... durch planmässiges Tattonieren. [... through systematic, palpable experimentation.] | |
—Johann Karl Friedrich Gauss [asked how he came upon his theorems] |
On this chapter, you will get deeper knowledge of PyTables internals. PyTables has several places where the user can improve the performance of his application. If you are planning to deal with really large data, you should read carefully this section in order to learn how to get an important efficiency boost for your code. But if your dataset is small or medium size (say, up to 10 MB), you should not worry about that as the default parameters in PyTables are already tuned to handle that perfectly.
The underlying HDF5 library that is used by PyTables allows for certain datasets (chunked datasets) to take the data in bunches of a certain length, so-called chunks, to write them on disk as a whole, i.e. the HDF5 library treats chunks as atomic objects and disk I/O is always made in terms of complete chunks. This allows data filters to be defined by the application to perform tasks such as compression, encryption, checksumming, etc. on entire chunks.
An in-memory B-tree is used to map chunk structures on disk. The more chunks that are allocated for a dataset the larger the B-tree. Large B-trees take memory and cause file storage overhead as well as more disk I/O and higher contention for the metadata cache. Consequently, it's important to balance between memory and I/O overhead (small B-trees) and time to access data (big B-trees).
PyTables can determine an optimum chunk size to make B-trees adequate to your dataset size if you help it by providing an estimation of the number of rows for a table. This must be made at table creation time by passing this value to the expectedrows keyword of the createTable method (see 4.2.2).
When your table size is bigger than 10 MB (take this figure only as a reference, not strictly), by providing this guess of the number of rows you will be optimizing the access to your data. When the table size is larger than, say 100MB, you are strongly suggested to provide such a guess; failing to do that may cause your application to do very slow I/O operations and to demand huge amounts of memory. You have been warned!