PyTables User's Guide: Hierarchical datasets in Python - Release 1.3.2 | ||
---|---|---|
Prev | Chapter 5. Optimization tips | Next |
Let's suppose that you have a file on which you have made a lot of row deletions on one or more tables, or deleted many leaves or even entire subtrees. These operations might leave holes (i.e. space that is not used anymore) in your files, that may potentially affect not only the size of the files but, more importantly, the performance of I/O. This is because when you delete a lot of rows on a table, the space is not automatically recovered on-the-flight. In addition, if you add many more rows to a table than specified in the expectedrows keyword in creation time this may affect performance as well, as explained in section 5.1.
In order to cope with these issues, you should be aware that a handy PyTables utility called ptrepack can be very useful, not only to compact your already existing leaky files, but also to adjust some internal parameters (both in memory and in file) in order to create adequate buffer sizes and chunk sizes for optimum I/O speed. Please, check the appendix C.2 for a brief tutorial on its use.
Another thing that you might want to use ptrepack for is changing the compression filters or compression levels on your existing data for different goals, like checking how this can affect both final size and I/O performance, or getting rid of the optional compressors like LZO, UCL or bzip2 in your existing files in case you want to use them with generic HDF5 tools that do not have support for these filters.