Chapter 1. Introduction

 

La sabiduría no vale la pena si no es posible servirse de ella para inventar una nueva manera de preparar los garbanzos.(Wisdom isn't worth anything if you can't use it to come up with a new way to cook garbanzos).

 —A wise Catalan in "Cien años de soledad" Gabriel García Márquez

The goal of PyTables is to enable the end user to manipulate easily data tables and array objects in a hierarchical structure. The foundation of the underlying hierarchical data organization is the excellent HDF5 library (see []).

It should be noted that this package is not intended to serve as a complete wrapper for the entire HDF5 API, but only to provide a flexible, very pythonic tool to deal with (arbitrarily) large amounts of data (typically bigger than available memory) in tables and arrays organized in a hierarchical and persistent disk storage structure.

A table is defined as a collection of records whose values are stored in fixed-length fields. All records have the same structure and all values in each field have the same data type. The terms fixed-length and strict data types may seem to be a strange requirement for an interpreted language like Python, but they serve a useful function if the goal is to save very large quantities of data (such as is generated by many data acquisition systems, Internet services or scientific applications, for example) in an efficient manner that reduces demand on CPU time and I/O.

In order to emulate in Python records mapped to HDF5 C structs PyTables implements a special class so as to easily define all its fields and other properties. PyTables also provides a powerful interface to mine data in tables. Records in tables are also known in the HDF5 naming scheme as compound data types.

For example, you can define arbitrary tables in Python simply by declaring a class with name field and types information, such as in the following example:


class Particle(IsDescription):
    name      = StringCol(16)   # 16-character String
    idnumber  = Int64Col()      # Signed 64-bit integer
    ADCcount  = UInt16Col()     # Unsigned short integer
    TDCcount  = UInt8Col()      # unsigned byte
    grid_i    = Int32Col()      # integer
    grid_j    = IntCol()        # integer (equivalent to Int32Col)
    class Properties(IsDescription):  # A sub-structure (nested data-type)
        pressure = Float32Col(shape=(2,3)) # 2-D float array (single-precision)
        energy   = FloatCol(shape=(2,3,4)) # 3-D float array (double-precision)

You then pass this class to the table constructor, fill its rows with your values, and save (arbitrarily large) collections of them to a file for persistent storage. After that, the data can be retrieved and post-processed quite easily with PyTables or even with another HDF5 application (in C, Fortran, Java or whatever language that provides a library to interface with HDF5).

Other important entities in PyTables are the array objects that are analogous to tables with the difference that all of their components are homogeneous. They come in different flavors, like generic (they provide a quick and fast way to deal with for numerical arrays), enlargeable (arrays can be extended in any single dimension) and variable length (each row in the array can have a different number of elements).

The next section describes the most interesting capabilities of PyTables.

1.1. Main Features

PyTables takes advantage of the object orientation and introspection capabilities offered by Python, the HDF5 powerful data management features and numarray flexibility and high-performance manipulation of large sets of objects organized in grid-like fashion to provide these features: