7.2. Using the tables.NetCDF module

The module tables.NetCDF emulates the Scientific.IO.NetCDF API using PyTables. It presents the data in the form of objects that behave very much like arrays. A tables.NetCDF file contains any number of dimensions and variables, both of which have unique names. Each variable has a shape defined by a set of dimensions, and optionally attributes whose values can be numbers, number sequences, or strings. One dimension of a file can be defined as unlimited, meaning that the file can grow along that direction. In the sections that follow, a step-by-step tutorial shows how to create and modify a tables.NetCDF file. All of the code snippets presented here are included in examples/netCDF_example.py. The tables.NetCDF module is designed to be used as a drop-in replacement for Scientific.IO.NetCDF, with only minor modifications to existing code. The differences between table.NetCDF and Scientific.IO.NetCDF are summarized in the last section of this chapter.

7.2.1. Creating/Opening/Closing a tables.NetCDF file

To create a tables.netCDF file from python, you simply call the NetCDFFile constructor. This is also the method used to open an existing tables.netCDF file. The object returned is an instance of the NetCDFFile class and all future access must be done through this object. If the file is open for write access ('w' or 'a'), you may write any type of new data including new dimensions, variables and attributes. The optional history keyword argument can be used to set the history NetCDFFile global file attribute. Closing the tables.NetCDF file is accomplished via the close method of NetCDFFile object.

Here's an example:


>>> import tables.NetCDF as NetCDF
>>> import time
>>> history = 'Created ' + time.ctime(time.time())
>>> file = NetCDF.NetCDFFile('test.h5', 'w', history=history)
>>> file.close()
	    

7.2.2. Dimensions in a tables.NetCDF file

NetCDF defines the sizes of all variables in terms of dimensions, so before any variables can be created the dimensions they use must be created first. A dimension is created using the createDimension method of the NetCDFFile object. A Python string is used to set the name of the dimension, and an integer value is used to set the size. To create an unlimited dimension (a dimension that can be appended to), the size value is set to None.


>>> import tables.NetCDF as NetCDF
>>> file = NetCDF.NetCDFFile('test.h5', 'a')
>>> file.NetCDFFile.createDimension('level', 12)
>>> file.NetCDFFile.createDimension('time', None)
>>> file.NetCDFFile.createDimension('lat', 90)
	    

All of the dimension names and their associated sizes are stored in a Python dictionary.


>>> print file.dimensions
{'lat': 90, 'time': None, 'level': 12}
	    

7.2.3. Variables in a tables.NetCDF file

Most of the data in a tables.NetCDF file is stored in a netCDF variable (except for global attributes). To create a netCDF variable, use the createVariable method of the NetCDFFile object. The createVariable method has three mandatory arguments, the variable name (a Python string), the variable datatype described by a single character Numeric typecode string which can be one of f (Float32), d (Float64), i (Int32), l (Int32), s (Int16), c (CharType - length 1), F (Complex32), D (Complex64) or 1 (Int8), and a tuple containing the variable's dimension names (defined previously with createDimension). The dimensions themselves are usually defined as variables, called coordinate variables. The createVariable method returns an instance of the NetCDFVariable class whose methods can be used later to access and set variable data and attributes.


>>> times = file.createVariable('time','d',('time',))
>>> levels = file.createVariable('level','i',('level',))
>>> latitudes = file.createVariable('latitude','f',('lat',))
>>> temp = file.createVariable('temp','f',('time','level','lat',))
>>> pressure = file.createVariable('pressure','i',('level','lat',))
	    

All of the variables in the file are stored in a Python dictionary, in the same way as the dimensions:


>>> print file.variables
{'latitude': <tables.NetCDF.NetCDFVariable instance at 0x244f350>,
 'pressure': <tables.NetCDF.NetCDFVariable instance at 0x244f508>,
 'level': <tables.NetCDF.NetCDFVariable instance at 0x244f0d0>,
 'temp': <tables.NetCDF.NetCDFVariable instance at 0x244f3a0>,
 'time': <tables.NetCDF.NetCDFVariable instance at 0x2564c88>}

	    

7.2.4. Attributes in a tables.NetCDF file

There are two types of attributes in a tables.NetCDF file, global (or file) and variable. Global attributes provide information about the dataset, or file, as a whole. Variable attributes provide information about one of the variables in the file. Global attributes are set by assigning values to NetCDFFile instance variables. Variable attributes are set by assigning values to NetCDFVariable instance variables.

Attributes can be strings, numbers or sequences. Returning to our example,


>>> file.description = 'bogus example to illustrate the use of tables.NetCDF'
>>> file.source = 'PyTables Users Guide'
>>> latitudes.units = 'degrees north'
>>> pressure.units = 'hPa'
>>> temp.units = 'K'
>>> times.units = 'days since January 1, 2005'
>>> times.scale_factor = 1
	    

The ncattrs method of the NetCDFFile object can be used to retrieve the names of all the global attributes. This method is provided as a convenience, since using the built-in dir Python function will return a bunch of private methods and attributes that cannot (or should not) be modified by the user. Similarly, the ncattrs method of a NetCDFVariable object returns all of the netCDF variable attribute names. These functions can be used to easily print all of the attributes currently defined, like this


>>> for name in file.ncattrs():
>>>     print 'Global attr', name, '=', getattr(file,name)
Global attr description = bogus example to illustrate the use of tables.NetCDF
Global attr history = Created Mon Nov  7 10:30:56 2005
Global attr source = PyTables Users Guide
	

Note that the ncattrs function is not part of the Scientific.IO.NetCDF interface.

7.2.5. Writing data to and retrieving data from a tables.NetCDF variable

Now that you have a netCDF variable object, how do you put data into it? If the variable has no unlimited dimension, you just treat it like a Numeric array object and assign data to a slice.


>>> import numarray
>>> levels[:] = numarray.arange(12)+1
>>> latitudes[:] = numarray.arange(-89,90,2)
>>> for lev in levels[:]:
>>>     pressure[:,:] = 1000.-100.*lev
>>> print 'levels = ',levels[:]
levels =  [ 1  2  3  4  5  6  7  8  9 10 11 12]
>>> print 'latitudes =\n',latitudes[:]
latitudes =
[-89. -87. -85. -83. -81. -79. -77. -75. -73. -71. -69. -67. -65. -63.
 -61. -59. -57. -55. -53. -51. -49. -47. -45. -43. -41. -39. -37. -35.
 -33. -31. -29. -27. -25. -23. -21. -19. -17. -15. -13. -11.  -9.  -7.
  -5.  -3.  -1.   1.   3.   5.   7.   9.  11.  13.  15.  17.  19.  21.
  23.  25.  27.  29.  31.  33.  35.  37.  39.  41.  43.  45.  47.  49.
  51.  53.  55.  57.  59.  61.  63.  65.  67.  69.  71.  73.  75.  77.
  79.  81.  83.  85.  87.  89.]

       

Note that retrieving data from the netCDF variable object works just like a Numeric array too. If the netCDF variable has an unlimited dimension, and there is not yet an entry for the data along that dimension, the append method must be used.


>>> for n in range(10):
>>>     times.append(n)
>>> print 'times = ',times[:]
times =  [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]

	    

The data you append must have either the same number of dimensions as the NetCDFVariable, or one less. The shape of the data you append must be the same as the NetCDFVariable for all of the dimensions except the unlimited dimension. The length of the data long the unlimited dimension controls how may entries along the unlimited dimension are appended. If the data you append has one fewer number of dimensions than the NetCDFVariable, it is assumed that you are appending one entry along the unlimited dimension. For example, if the NetCDFVariable has shape (10,50,100) (where the dimension length of length 10 is the unlimited dimension), and you append an array of shape (50,100), the NetCDFVariable will subsequently have a shape of (11,50,100). If you append an array with shape (5,50,100), the NetCDFVariable will have a new shape of (15,50,100). Appending an array whose last two dimensions do not have a shape (50,100) will raise an exception. This append method does not exist in the Scientific.IO.NetCDF interface, instead entries are appended along the unlimited dimension one at a time by assigning to a slice. This is the biggest difference between the tables.NetCDF and Scientific.IO.NetCDF interfaces.

Once data has been appended to any variable with an unlimited dimension, the sync method can be used to synchronize the sizes of all the other variables with an unlimited dimension. This is done by filling in missing values (given by the default netCDF _FillValue, which is intended to indicate that the data was never defined). The sync method is automatically invoked with a NetCDFFile object is closed. Once the sync method has been invoked, the filled-in values can be assigned real data with slices.


>>> print 'temp.shape before sync = ',temp.shape
temp.shape before sync =  (0, 12, 90)
>>> file.sync()
>>> print 'temp.shape after sync = ',temp.shape
temp.shape after sync =  (10L, 12, 90)
>>> import numarray.random_array as random_array
>>> for n in range(10):
>>>     temp[n] = 10.*random_array.random(pressure.shape)
>>>     print 'time, min/max temp, temp[n,0,0] = ',\
               times[n],min(temp[n].flat),max(temp[n].flat),temp[n,0,0]
time, min/max temp, temp[n,0,0] = 0.0 0.0122650898993 9.99259281158 6.13053750992
time, min/max temp, temp[n,0,0] = 1.0 0.00115821603686 9.9915933609 6.68516159058
time, min/max temp, temp[n,0,0] = 2.0 0.0152112031356 9.98737239838 3.60537290573
time, min/max temp, temp[n,0,0] = 3.0 0.0112022599205 9.99535560608 6.24249696732
time, min/max temp, temp[n,0,0] = 4.0 0.00519315246493 9.99831295013 0.225010097027
time, min/max temp, temp[n,0,0] = 5.0 0.00978941563517 9.9843454361 4.56814193726
time, min/max temp, temp[n,0,0] = 6.0 0.0159023851156 9.99160385132 6.36837291718
time, min/max temp, temp[n,0,0] = 7.0 0.0019518379122 9.99939727783 1.42762875557
time, min/max temp, temp[n,0,0] = 8.0 0.00390585977584 9.9909954071 2.79601073265
time, min/max temp, temp[n,0,0] = 9.0 0.0106026884168 9.99195957184 8.18835449219

	    

Note that appending data along an unlimited dimension always increases the length of the variable along that dimension. Assigning data to a variable with an unlimited dimension with a slice operation does not change its shape. Finally, before closing the file we can get a summary of its contents simply by printing the NetCDFFile object. This produces output very similar to running 'ncdump -h' on a netCDF file.


>>> print file
test.h5 {
dimensions:
    lat = 90 ;
    time = UNLIMITED ; // (10 currently)
    level = 12 ;
variables:
    float latitude('lat',) ;
        latitude:units = 'degrees north' ;
    int pressure('level', 'lat') ;
        pressure:units = 'hPa' ;
    int level('level',) ;
    float temp('time', 'level', 'lat') ;
        temp:units = 'K' ;
    double time('time',) ;
        time:scale_factor = 1 ;
        time:units = 'days since January 1, 2005' ;
// global attributes:
        :description = 'bogus example to illustrate the use of tables.NetCDF' ;
        :history = 'Created Wed Nov  9 12:29:13 2005' ;
        :source = 'PyTables Users Guilde' ;
}
	    

7.2.6. Efficient compression of tables.NetCDF variables

Data stored in NetCDFVariable objects is compressed on disk by default. The parameters for the default compression are determined from a Filters class instance (see section 4.17.1) with complevel=6, complib='zlib' and shuffle=1. To change the default compression, simply pass a Filters instance to createVariable with the filters keyword. If your data only has a certain number of digits of precision (say for example, it is temperature data that was measured with a precision of 0.1 degrees), you can dramatically improve compression by quantizing (or truncating) the data using the least_significant_digit keyword argument to createVariable. The least significant digit is the power of ten of the smallest decimal place in the data that is a reliable value. For example if the data has a precision of 0.1, then setting least_significant_digit=1 will cause data the data to be quantized using numarray.around(scale*data)/scale, where scale = 2**bits, and bits is determined so that a precision of 0.1 is retained (in this case bits=4).

In our example, try replacing the line


>>> temp = file.createVariable('temp','f',('time','level','lat',))
	    

with


>>> temp = file.createVariable('temp','f',('time','level','lat',),
                               least_significant_digit=1)
	    

and see how much smaller the resulting file is.

The least_significant_digit keyword argument is not allowed in Scientific.IO.NetCDF, since netCDF version 3 does not support compression. The flexible, fast and efficient compression available in HDF5 is the main reason I wrote the tables.NetCDF module - my netCDF files were just getting too big.

The createVariable method has one other keyword argument not found in Scientific.IO.NetCDF - expectedsize. The expectedsize keyword can be used to set the expected number of entries along the unlimited dimension (default 10000). If you expect that your data with have an order of magnitude more or less than 10000 entries along the unlimited dimension, you may consider setting this keyword to improve efficiency (see section 5.1 for details).