5.3. Compression issues

One of the beauties of PyTables is that it supports compression on tables and arrays[1], although it is not used by default. Compression of big amounts of data might be a bit controversial feature, because compression has a legend of being a very big consumer of CPU time resources. However, if you are willing to check if compression can help not only by reducing your dataset file size but also by improving I/O efficiency, specially when dealing with very large datasets, keep reading.

There is a common scenario where users need to save duplicated data in some record fields, while the others have varying values. In a relational database approach such redundant data can normally be moved to other tables and a relationship between the rows on the separate tables can be created. But that takes analysis and implementation time, and makes the underlying libraries more complex and slower.

PyTables transparent compression allows the users to not worry about finding which is their optimum strategy for data tables, but rather use less, not directly related, tables with a larger number of columns while still not cluttering the database too much with duplicated data (compression is responsible to avoid that). As a side effect, data selections can be made more easily because you have more fields available in a single table, and they can be referred in the same loop. This process may normally end in a simpler, yet powerful manner to process your data (although you should still be careful about in which kind of scenarios the use of compression is convenient or not).

The compression library used by default is the Zlib (see []). Since HDF5 requires it, you can safely use it and expect that your HDF5 files will be readable on any other platform that has HDF5 libraries installed. Zlib provides good compression ratio, although somewhat slow, and reasonably fast decompression. Because of that, it is a good candidate to be used for compressing you data.

However, in some situations it is critical to have very good decompression speed (at the expense of lower compression ratios or more CPU wasted on compression, as we will see soon). In others, the emphasis is put in achieving the maximum compression ratios, no matter which reading speed will result. This is why support for two additional compressors has been added to PyTables: LZO (see []) and bzip2 (see []). Following the author of LZO (and checked by the author of this section, as you will see soon), LZO offers pretty fast compression (though a small compression ratio) and extremely fast decompression. In fact, LZO is so fast when compressing/decompressing that it may well happen (that depends on your data, of course) that writing or reading a compressed dataset is sometimes faster than if it is not compressed at all (specially when dealing with extremely large datasets). This fact is very important, specially if you have to deal with very large amounts of data. Regarding bzip2, it has a reputation of achieving excellent compression ratios, but at the price of spending much more CPU time, which results in very low compression/decompression speeds.

Be aware that the LZO and bzip2 support in PyTables is not standard on HDF5, so if you are going to use your PyTables files in other contexts different from PyTables you will not be able to read them. Still, see the appendix C.2 (where the ptrepack utility is described) to find a way to free your files from LZO or bzip2 dependencies, so that you can use these compressors locally with the warranty that you can replace them with Zlib (or even remove compression completely) if you want to use these files with other HDF5 tools or platforms afterwards.

In order to allow you to grasp what amount of compression can be achieved, and how this affects performance, a series of experiments has been carried out. All the results presented in this section (and in the next one) have been obtained with synthetic data and using PyTables 1.3. Also, the tests have been conducted on a IBM OpenPower 720 (e-series) with a PowerPC G5 at 1.65 GHz and a hard disk spinning at 15K RPM. As your data and platform may be totally different for your case, take this just as a guide because your mileage will probably vary. Finally, and to be able to play with tables with a number of rows as large as possible, the record size has been chosen to be small (16 bytes). Here is its definition:


class Bench(IsDescription):
    var1 = StringCol(length=4)
    var2 = IntCol()
    var3 = FloatCol()
		

With this setup, you can look at the compression ratios that can be achieved in plot 5.4. As you can see, LZO is the compressor that performs worse in this sense, but, curiosly enough, there is not much difference between Zlib and bzip2.

Figure 5.4. Comparison between different compression libraries.

Also, PyTables lets you select different compression levels for Zlib and bzip2, although you may get a bit disappointed by the small improvement that show these compressors when dealing with a combination of numbers and strings as in our example. As a reference, see plot 5.5 for a comparison of the compression achieved by selecting different levels of Zlib. Very oddly, the best compression ratio corresponds to level 1 (!). It's difficult to explain that, but this lesson will serve to reaffirm that there is no replacement for experiments with your own data. In general, it is recommended to select the lowest level of compression in order to achieve best performance and decent (if not the best!) compression ratio. See later for more figures on this regard.

Figure 5.5. Comparison between different compression levels of Zlib.

Have also a look at graph 5.6. It shows how the speed of writing rows evolves as the size (the row number) of the table grows. Even though in these graphs the size of one single row is 16 bytes, you can most probably extrapolate these figures to other row sizes.

Figure 5.6. Writing tables with several compressors.

In plot 5.7 you can see how compression affects the reading performance. In fact, what you see in the plot is an in-kernel selection speed, but provided that this operation is very fast (see section 5.2.1), we can accept it as an actual read test. Compared with the reference line without compression, the general trend here is that LZO does not affect too much the reading performance (and in some points it is actually better), Zlib makes speed to drop to a half, while bzip2 is performing very slow (up to 8x slower).

Also, in the same figure 5.7 you can notice some strange peaks in the speed that we might be tempted to attribute to libraries on which PyTables relies (HDF5, compressors...), or to PyTables itself. However, graph 5.8 reveals that, if we put the file in the filesystem cache (by reading it several times before, for example), the evolution of the performance is much smoother. So, the most probable explanation would be that such a peaks are a consequence of the underlying OS filesystem, rather than a flaw in PyTables (or any other library behind it). Another consequence that can be derived from the above plot is that LZO decompression performance is much better than Zlib, allowing an improvement in overal speed of more than 2x, and perhaps more important, the read performance for really large datasets (i.e. when they do not fit in the OS filesystem cache) can be actually better than not using compression at all. Finally, one can see that reading performance is very badly affected when bzip2 is used (it is 10x slower than LZO and 4x than Zlib), but this is not too strange anyway.

Figure 5.7. Selecting values in tables with several compressors. The file is not in the OS cache.

Figure 5.8. Selecting values in tables with several compressors. The file is in the OS cache.

So, generally speaking and looking at the experiments above, you can expect that LZO will be the fastest in both compressing and decompressing, but the one that achieves the worse compression ratio (although that may be just OK for many situations, specially when used with the shuffle filter 5.4). bzip2 is the slowest, by large, in both compressing and decompressing, and besides, it does not achieve any better compression ratio than Zlib. Zlib represents a balance between them: it's somewhat slow compressing (2x) and decompressing (3x) than LZO, but it normally achieves fairly good compression ratios.

Finally, by looking at the plots 5.9, 5.10, and the aforementioned 5.5 you can see why the recommended compression level to use for all compression libraries is 1. This is the lowest level of compression, but if you take the approach suggested above, the redundant data is to be found normally in the same row, making redundancy locality very high so that a small level of compression should be enough to achieve a good compression ratio on your data tables, saving CPU cycles for doing other things. Nonetheless, in some situations you may want to check for your own how the different compression levels affect your application.

You can select the compression library and level by setting the complib and complevel keywords in the Filters class (see 4.17.1). A compression level of 0 will completely disable compression (the default), 1 is the less CPU time demanding level, while 9 is the maximum level and most CPU intensive. Finally, have in mind that LZO is not accepting a compression level right now, so, when using LZO, 0 means that compression is not active, and any other value means that LZO is active.

So, in conclusion, if your ultimate goal is writing and reading as fast as possible, choose LZO. If you want to reduce as much as possible your data, while retaining acceptable read speed, choose Zlib. Finally, if portability is important for you, Zlib is your best bet. So, when you want to use bzip2? Well, looking at the results, it is difficult to recommend its use in general, but you may want to experiment with it in those cases where you know that it is well suited for your data pattern (for example, for dealing with repetitive string datasets).

Figure 5.9. Writing in tables with different levels of compression.

Figure 5.10. Selecting values in tables with different levels of compression. The file is in the OS cache.

Notes

[1]

More precisely, it is supported in CArray, EArray and VLArray objects, but not in Array objects.