Accelerating SQL Queries in MapR Database JSON Using Apache Drill and Secondary Indexes

Contributed by

17 min read

Introduction

MapR Database JSON introduced and later expanded the support for native secondary indexes in the 6.x releases (see Secondary Indexes). These indexes are accessible both through a programmatic API such as OJAI and via SQL queries processed by Apache Drill. The combination of Drill plus MapR Database with secondary indexes is a powerful combination that brings the full breadth of ANSI SQL to a NoSQL database. It accelerates queries by orders of magnitude by avoiding full table scans, making this software stack an ideal platform to develop analytic applications on an operational data store.

Please see the section on ‘Analytic Engines’ in the blog Enterprise-Ready Database for Analytics the top level blog for an understanding of where an analytic engine fits into the broader database platform.

In this blog post, we first give a brief overview of Drill + MapR Database architecture and where secondary indexes fit into the picture. We then describe specific query patterns that benefit from secondary indexes. Queries with indexed columns in the WHERE clause are analyzed, including those involving CAST functions. Index intersection of 2 or more indexes is described. Additional concepts, such as statistics and index selection, are discussed to give the reader an insight into how Drill chooses the appropriate index.

A future blog will cover indexed columns in ORDER BY, GROUP BY, and Join conditions and also will describe additional optimization for row key join pushdown.

Drill with MapR Database

Drill is a distributed SQL query engine and has an extensible plugin-based architecture to handle a variety of data sources through storage or format plugins. For an overview of Drill’s query execution, please see query execution.

The above diagram shows a logical view of the interaction of Drill’s planner and executor frameworks with MapR Database format plugin. The plugin invokes MapR Database JSON Client APIs for table and index metadata, including statistics during query planning and the OJAI DocumentReader and related APIs during execution. These Java APIs are the primary way Drill works with the underlying MapR Database tables and indexes.

When a SQL query is received at a Drillbit, that node acts as a ‘Foreman’ node (coordinator) for that particular query (different queries may have different Foreman nodes). The query goes through several planning stages:

  • Parsing and validation
  • Logical plan generation
  • Physical plan generation
  • Parallelization

The logical and physical plan generation stages include both the core Drill planning rules and the plugin specific planning rules. The index planning rule is triggered in the core Drill planner during the physical planning phase and is followed up by the MapR Database plugin-specific filter pushdown and projection pushdown rules.

The MapR Database plugin also exposes the characteristics of the underlying primary table – for instance, number of tablets (which influences parallelization decisions) and their host affinity (needed to exploit data locality). Additionally, for index planning, it calls metadata APIs for the list of valid indexes for that table, and for each such index, it retrieves various properties associated with the index, including statistics.

Once the physical plan (with or without indexes) is created, the plan is serialized and sent by the Foreman to all Drillbits in the cluster for execution. On each node, Drill’s Scan operator encapsulates the record reader specific to the plugin – in this case, the reader for MapR Database’s JSON documents. The record reader for each tablet creates batches of data organized in Drill’s in-memory columnar format and the Scan operator sends these to downstream operators for further processing (joins, aggregations, sort, etc.). The diagram below depicts this pipelined execution model. An exchange operator forms a ‘major fragment’ boundary. Each major fragment is a logical phase of execution within which multiple minor fragments (threads) are run in parallel.

Secondary Indexing in MapR Database

Tables in MapR Database JSON are organized by the document ID, identified by the _id field, which serves as the primary key of the table. Document retrieval by this field is extremely fast, since it allows reading only a subset of the primary table, minimizing disk I/O. However, if the query’s filter conditions do not involve the _id field but instead are on secondary fields, MapR Database will have to do an expensive full table scan, unless it can leverage a secondary index on that field. For an understanding of the motivations for secondary indexing as well as related concepts, please see ‘What Are Secondary Indexes?

Filter Conditions in the WHERE Clause

Index-Only Scans for Covering Index

We will use the TPC-H lineitem table for the example queries. Consider the following query:

Query 1:

SELECT SUM(L_QUANTITY) AS total_quantity
FROM   lineitem
WHERE  L_SHIPMODE IN ( 'AIR', 'SHIP' )
       AND L_SHIPDate BETWEEN DATE '1995-01-01' AND DATE '1995-01-31';

Suppose there are several other queries in the workload having filter conditions on either L_SHIPMODE or L_SHIPDate. These columns would be good candidates for indexing. Before creating the indexes, it is advisable to check the number of distinct values (NDV) of the columns. This can be an estimate, but if the estimates are not readily available, you may need to run a simple COUNT(DISTINCT ) on the table, with the caveat that such queries may be long running on a large table. In the lineitem table, the NDV(L_SHIPMODE) = 7. Creating a single column index on a column with such small NDV is not recommended. The NDV(L_SHIPDate) = 2518. Thus, creating a composite index on both columns is recommended such that the combined NDV is large.

Before running the query again, it is recommended to check the EXPLAIN plan to see if the index is being used. In the EXPLAIN output, look for the property ‘indexName’ in the Scan node. This indicates the index is getting used for that table.

In this example, the index is a covering index because all the columns needed in the query are present in the index, either as indexed fields or as included fields. Thus, Drill will create an index-only scan plan. Note that the filter conditions on L_SHIPDate and L_SHIPMODE have been pushed into the index scan. They are evaluated by the MapR Database server and only the qualifying rows are returned over the wire to Drill via the MapR Database client.

Now, suppose the query was as follows:

Query 2:

SELECT SUM(L_QUANTITY) AS total_quantity
FROM   lineitem
WHERE  L_SHIPDate BETWEEN DATE '1995-01-01' AND DATE '1995-01-31';

For planning/execution of this query, the index that was created for Query 1 will NOT be used. The composite index keys {L_SHIPMODE, L_SHIPDate} impose a sort order with the sort key being the pair of columns. Drill will only consider the index if range pruning can be done by the MapR Database server, when doing the index scan. Since the WHERE clause only contains L_SHIPDate, which is not the leading prefix of the index fields, range pruning is not possible; hence, the index is not eligible. An alternative is to make L_SHIPDate the leading key of another index.

What if we want to select additional columns that are not present in the index? This is a more general case, since creating covering indexes for a diverse workload is not always feasible or even recommended (due to extra storage and index maintenance costs). This scenario is discussed below.

Join-Back to Primary Table for Non-Covering Index

Consider the following query, which references the additional column L_DISCOUNT, which is not present in the index.

Query 3:

SELECT SUM(L_QUANTITY) AS total_quantity,
       MAX(L_DISCOUNT) AS max_discount,
FROM   lineitem
WHERE  L_SHIPMODE IN ( 'AIR', 'SHIP' )
       AND L_SHIPDate BETWEEN DATE '1995-01-01' AND DATE '1995-01-31';

To process this query, Drill planner will generate a ‘non-covering index’ plan, as shown below:

Note again that the index l_shipmode_shipdate is used for the index scan, similar to the covering index plan, except in this case we only produce the row key field (_id field) from the index as shown by the highlighted Project node. This feeds into a RowKeyJoin plan, which does a ‘join-back’ to the primary table because we need to project all the columns needed in the query. Note that this is not a real join, such as inner or outer join, but rather a lookup of rows based on a set of row keys. In general, the values of the row keys will be spread randomly throughout the primary table; i.e., it is very likely that all tablets of the primary table have rows corresponding to L_SHIPMODE IN (‘AIR’, ‘SHIP’). Therefore, doing this lookup is a random I/O operation from the primary table. This is achieved through a special type of scan called Restricted Scan (shown as RestrictedJsonTableGroupScan in the plan).

Since Drill is a distributed query engine, such a lookup will be performed on multiple minor fragments. The row keys fetched from the index are sent to many destination minor fragments via a partitioning exchange operation that takes into account data locality. The receiving side performs the row lookup. For simplicity, the parallelized plan is not shown above.

Views with CAST Functions

Analytic workloads often rely on views where the column is CAST to a certain data type. In particular, since MapR Database JSON tables don’t have a schema defined in the metadata, defining a view by specifying the CAST becomes a useful mechanism to provide a schema (assuming the schema of the fields is not changing). Here’s an example view:

Consider the following query on the view:

Query 4:

SELECT SUM(L_QUANTITY)
FROM   lineitem_view
WHERE  L_SHIPINSTRUCT LIKE 'DELIVER%';

Here the predicate is on the view column L_SHIPINSTRUCT, which is actually an alias to CAST(L_SHIPINSTRUCT as varchar(20)). MapR Database supports creating functional indexes on CAST functions, an extremely useful feature for this use case. Let’s create such an index:

Drill will examine the filter conditions on the view and leverage the underlying functional index wherever possible. This is shown in the following plan, where the LIKE predicate is pushed into the index scan, corresponding to the CAST functional index. The ‘$0 MATCHES..’ indicates the 0-th position in the composite key, corresponding to the CAST(L_SHIPINSTRUCT).

Index Intersection

Consider a filter condition 'WHERE a > 10 AND b < 20.' Suppose a single composite key index does not exist on both columns, but two separate indexes exist on 'a' and 'b.' Drill will generate an index intersect plan, where row keys from each index are retrieved and intersected and only the common row keys are used for the join-back to primary table.

It is important to remember that the index intersection plan is costed along with the single index plans, and whichever is cheaper is chosen. The intersection is done through a hash join operator, so this adds cost (CPU, memory, and network I/O if the build side of the hash join has to be broadcasted), so it is possible that in some cases the single index plan may turn out cheaper – this depends on how much reduction in selectivity happens after intersection.

Statistics and Index Selection

A table may have several indexes, including single and composite key indexes, range indexes, and hash indexes. Several of these indexes may have an overlapping list of index fields or included fields. For a given query, it is important to pick the right index(es) to get the optimal performance. Statistics, such as estimated number of rows matching a filter condition and the average row size, are important parameters for the cost-based index selection. Drill will ask MapR Database for the estimated row count for a single-column filter condition and then compute the combined row count based on ANDs and ORs in the condition. For example, suppose the filter condition is:

WHERE a > 10 AND b < 20 AND c LIKE ‘xyz%’

Drill will consult MapR Database for the individual selectivities of a, b, and c predicates. Then it will multiply the three, since they are ANDed together. For OR-ed predicates, the selectivities are added and capped by a maximum value.

Once the statistics are available, Drill planner in conjunction with Apache Calcite provides a cost-based index selection. In addition, the Drill planner employs a heuristic to reduce the search space of planning, as described below in 'Selectivity and Thresholds.' For each candidate index, it estimates the total cost of the index access plus join-back to primary table cost (for non-covering index). Based on these, the candidate indexes are ranked, based on leading selectivity, collation (sortedness) property, and whether it is a covering or non-covering index. The top five indexes per table are chosen (this number is configurable) to be considered for plan generation. These may also include index intersection. Thus, it is possible that index i1 and i2 may individually not qualify, based on selectivity, but their combined selectivity after intersection could be low enough to qualify. The planner compares the cumulative cost of the possible index plans along with the original full table scan plan and picks the cheapest plan for execution.

Selectivity and Thresholds

For covering index, the Drill planner will generate a covering index plan, even up to 100% estimated selectivity. The expectation is that an index-only plan is going to be cheaper compared to a full table scan, due to smaller row widths of the index. For non-covering indexes, due to the random I/O nature of the rowkey join-back to primary table, the default selectivity threshold is small: 2.5%. This is configurable through the setting _planner.index.noncovering_selectivity_threshold_. If the estimated selectivity of the filter condition is above this threshold, the non-covering index plan is not generated for that index; the rationale for this is that each new plan adds to the optimizer search space and increases planning time, so if the estimated row count is already high, it is unlikely to be chosen and it is better to prune it out early. In addition, a global configuration setting: _planner.enable_index_planning_ (default is TRUE) enables or disables index planning altogether.

Summary

Analytic workloads running complex SQL queries via Drill on MapR Database JSON tables can get substantial performance boost by leveraging secondary indexes in MapR Database. Drill’s query optimizer performs a cost-based index selection based on statistics gathered from MapR Database. It compares the cost of a full table scan with candidate index plans and picks the cheapest plan. The index plans may include covering index (index-only scan) plans or non-covering index involving join-back to the primary table to retrieve columns that are not in the index. Index intersection of two or more indexes and queries on functional indexes involving CAST functions is also supported.

These capabilities make the software stack of Drill + MapR Database with Secondary Indexes an ideal platform to develop SQL analytic applications on an operational data store. In a future blog, we will describe additional capabilities related to this stack.


This blog post was published January 04, 2019.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now