Partition pruning is a performance optimization that limits the number of files that Drill reads for file system queries. Drill only reads a subset of the files that reside in a file system when a query matches certain filter criteria. Typically, table data in a file system is organized by directories and subdirectories. Queries on table data may contain
WHERE clause filters on specific (sub) directories.
Drill’s query planner evaluates the filters as part of a Filter operator. If no partition filters are present, the underlying Scan operator reads all files in all (sub) directories and then sends the data to operators downstream, such as Filter.
When partition filters are present, the query planner determines if it can push the filters down to the Scan such that the Scan only reads the (sub) directories that match the partition filters, thus reducing disk I/O.
Partition Pruning Example
Users/max/data/logs directory in a file system contains (sub) directories that span a few years.
The following image shows the hierarchical structure of the
…/logs directory and (sub) directories:
The following query requests log file data for 2013 from the
…/logs directory in the file system:
If you run the
EXPLAIN PLAN command for the query, you can see that the
…/logs directory is filtered by the scan operator.
The following image shows a portion of the physical plan when partition pruning is applied:
The following queries include examples of the types of filters eligible for partition pruning optimization:
Example 1: Partition filters ANDed together
Example 2: Partition filter ANDed with regular column filter
Example 3: Combination of AND, OR involving partition filters