PyTables User's Guide: Hierarchical datasets in Python - Release 1.3.2 | ||
---|---|---|
Prev | Chapter 5. Optimization tips | Next |
If you are going to use a lot of searches like the next one:
row = table.row result = [ row['var2'] for row in table if row['var1'] <= 20 ](for future reference, we will call this the standard selection mode) and you want to improve the time taken to run it, keep reading.
PyTables provides a way to accelerate data selections when they are simple, i.e. when only a column is implied in the selection process, through the use of the where iterator (see 4.6.2). We will call this mode of selecting data in-kernel. Let's see an example of in-kernel selection based on the standard selection mentioned above:
row = table result = [ row['var2'] for row in table.where(table.cols.var1 <= 20)]This simple change of mode selection can account for an improvement in search times up to a factor of 10 (see the figure 5.1).
Figure 5.1. Times for different selection modes over Int32 values. Benchmark made on a machine with Itanium (IA64) @ 900 MHz processors with SCSI disk @ 10K RPM.
Figure 5.2. Times for different selection modes over Float64 values. Benchmark made on a machine with Itanium (IA64) @ 900 MHz processors with SCSI disk @ 10K RPM.
So, where is the trick? It's easy. In the standard selection mode the data for column var1 has to be carried up to Python space so as to evaluate the condition and decide if the var2 value should be added to the result list. On the contrary, in the in-kernel mode, the condition is passed to the PyTables kernel (hence the name), written in C, and evaluated there at C speed (with some help of the numarray package), so that the only values that are brought to the Python space are the references for rows that fulfilled the condition.
You should note, however, that currently the where method only accepts conditions along a single column[1]. Fortunately, you can mix the in-kernel and standard selection modes for evaluating arbitrarily complex conditions along several columns at once. Look at this example:
row = table result = [ row['var2'] for row in table.where(table.cols.var3 == "foo") if row['var1'] <= 20 ]here, we have used a in-kernel selection to filter the rows whose var3 field is equal to string "foo". Then, we apply a standard selection to complete the query.
Of course, when you mix the in-kernel and standard selection modes you should pass the most restrictive condition to the in-kernel part, i.e. to the where iterator. In situations where it is not clear which is the most restrictive condition, you might want to experiment a bit in order to find the best combination.
When you need more speed than in-kernel selections can offer you, PyTables offers a third selection method, the so-called indexed mode. In this mode, you have to decide which column(s) you are going to do your selections on, and index them. Indexing is just a kind of sort operation, so that next searches along a column will look at the sorted information using a binary search which is much faster than a sequential search.
You can index your selected columns in several ways:
In this mode, you can declare a column as being indexed by passing the indexed parameter to the column descriptor. That is:
class Example(IsDescription): var1 = StringCol(length=4, dflt="", pos=1, indexed=1) var2 = BoolCol(0, indexed=1, pos = 2) var3 = IntCol(0, indexed=1, pos = 3) var4 = FloatCol(0, indexed=0, pos = 4)
In this case, we are telling that var1, var2 and var3 columns will be indexed automatically when you add rows to the table with this description.
In this mode, you can create an index even on an already created table. For example:
indexrows = table.cols.var1.createIndex() indexrows = table.cols.var2.createIndex() indexrows = table.cols.var3.createIndex()
will create indexes for all var1, var2 and var3 columns, and after doing that, they will behave as regular indexes.
After you have indexed a column, you can proceed to use it through the use of Table.where method:
row = table result = [ row['var2'] for row in table.where(table.cols.var1 == "foo") ]or, if you want to add more conditions, you can mix the indexed selection with a standard one:
row = table result = [ row['var2'] for row in table.where(table.cols.var3 <= 20) if row['var1'] == "foo" ]remember to pass the most restrictive condition to the where iterator.
You can see in figures 5.1 and 5.2 that indexing can accelerate quite a lot your data selections in tables. For moderately large tables (> one million rows), you can get speedups in the order of 100x with regard to in-kernel selections, and in the order of 1000x with regard to standard selections.
One important aspect of indexation in PyTables is that it has been implemented with the goal of being capable to manage effectively very large tables. In figure 5.3, you can see that the times to index columns in tables always grow linearly. In particular, the time to index a couple of columns with 1 billion of rows each is 40 min. (roughly 20 min. each), which is a quite reasonable figure. This is because PyTables has chosen an algorithm that does a partial sort of the columns in order to ensure that the indexing time grows linearly. On the contrary, most of relational databases try to do a complete sort of columns, and this makes the time to index grow much faster with the number of rows.
The fact that relational databases use a complete sorting algorithm for indexes means that their index would be more effective (but not by a large extent) for searching purposes than the PyTables approach. However, for relatively large tables (> 10 millions of rows) the time required for completing such a sort can be so large, that indexing is not normally worth the effort. In other words, PyTables indexing scales much better than relational databases. So don't worry if you have extremely large columns to index: PyTables is designed to cope with that perfectly.
[1] | PyTables Pro will address this shortcoming. |