MapR 5.0 Documentation : Partition Pruning

Partition pruning is a performance optimization that limits the number of files that Drill reads for file system queries. Drill only reads a subset of the files that reside in a file system when a query matches certain filter criteria. Typically, table data in a file system is organized by directories and subdirectories. Queries on table data may contain WHERE clause filters on specific (sub) directories.   

Drill’s query planner evaluates the filters as part of a Filter operator. If no partition filters are present, the underlying Scan operator reads all files in all (sub) directories and then sends the data to operators downstream, such as Filter.   

When partition filters are present, the query planner determines if it can push the filters down to the Scan such that the Scan only reads the (sub) directories that match the partition filters, thus reducing disk I/O.

Partition Pruning Example

The /Users/max/data/logs directory in a file system contains (sub) directories that span a few years.

The following image shows the hierarchical structure of the …/logs directory and (sub) directories:

 

 

The following query requests log file data for 2013 from the …/logs directory in the file system:

SELECT * FROM dfs.`/Users/max/data/logs` WHERE cust_id < 10 and dir0 = 2013 limit 2;

If you run the EXPLAIN PLAN command for the query, you can see that the …/logs directory is filtered by the scan operator.

EXPLAIN PLAN FOR SELECT * FROM dfs.`/Users/max/data/logs` WHERE cust_id < 10 and dir0 = 2013 limit 2;

The following image shows a portion of the physical plan when partition pruning is applied:

Filter Examples

The following queries include examples of the types of filters eligible for partition pruning optimization:

Example 1: Partition filters ANDed together

SELECT * FROM dfs.`/Users/max/data/logs` WHERE dir0 = '2014' AND dir1 = '1'

Example 2: Partition filter ANDed with regular column filter

SELECT * FROM dfs.`/Users/max/data/logs` WHERE cust_id < 10 AND dir0 = 2013 limit 2;

Example 3: Combination of AND, OR involving partition filters

SELECT * FROM dfs.`/Users/max/data/logs` WHERE (dir0 = '2013' AND dir1 = '1') OR (dir0 = '2014' AND dir1 = '2')