Dataware for data-driven transformation

Drill 1.14 Released on MapR

Contributed by

6 min read

We just released yet another version of Drill, our wildly popular, open source, SQL-on-anything query engine. Apache Drill 1.14 is now available in MapR 6.1 as part of MEP 6.0 (MapR Expansion Pack). As we continue to execute on the core themes for Drill, you'll see improved query performance and resource utilization with the following new features and improvements:

  • A new SQL construct: LATERAL .. UNNEST
  • Leveraging of indexes on complex nested types
  • More granularity in spill to disk
  • Ability to configure batch sizes to reduce memory usage
  • Faster reads of Parquet files
  • An improved Web UI

  • Ability to provide storage plug-in definitions at boot time

  • A Calcite upgrade for broader SQL coverage
  • 129 bugs squashed! (See the release notes for a complete list of bugs.)

New SQL construct LATERAL .. UNNEST

LATERAL .. UNNEST is an ANSI SQL construct that efficiently performs operations on nested data.

UNNEST is a table function applied on arrays. UNNEST acts like a SQL for each loop because the function iterates over each array element, applying any functions, limits, or aggregations on the array elements before associating them with the parent row through LATERAL.

Consider the following example:

Without LATER JOIN .. UNNEST

With LATERAL .. UNNEST

Using LATERAL … UNNEST improves performance by an order of magnitude. Intelligently pushing down aggregate, filter, and limit conditions onto child elements of a complex type reduces the need to explode the array elements into separate records.

Low latency queries now possible on nested MapR Database types

Starting in MapR v6.1, secondary indexes can be created on complex types, like ARRAYs and MAPs, stored in MapR Database. Index keys can be defined on entire ARRAYs themselves or on elements of arrays, even if they are MAPs. Drill can leverage these indexes to get low-latency document lookups when querying these elements.

Consider the following example:

How good is this, anyway? Testing shows significant performance improvements, as well as a significantly reduced memory footprint since fewer rows need to be processed in memory.

More granular spill to disk capability to support high concurrency

Drill processes data in-memory to provide high performance; however, memory is a limited resource that must be made available to all tenants in a cluster. If you run Drill in a multi-tenant cluster, you can configure Drill to use a fixed amount of memory such that any additional memory allocation requests from operations spill to disk. Spilling data to disk prevents Drill queries from failing with OOM errors when running multiple queries in a crowded cluster.

Starting in Drill 1.14, the Hash Join operator can spill to disk. Previously, only the Sort and Hash Aggregate operators spilled data to disk. The Hash Join operator is one of the most frequently seen operators in any query with a join. By adding spill to disk support for the Hash Join operator, you have more fine-grained control over the memory for in-memory processing and spill to disk.

You can run all queries in memory, limiting the level of concurrency in memory-constrained clusters, or you can configure the amount of memory spilling operators can use before they spill data to disk, which frees up system memory to support higher concurrency.

Ability to configure batch size to reduce memory usage

Batch sizing – introduced in Drill 1.13 (see Drill 1.13 blog) – specifies the size of the batches that Drill processes internally. Increasing the batch size allocates more memory up front, which reduces operator back and forth and increases overall performance, but at a higher memory utilization. Decreasing the batch size reduces the memory utilization of a query, but at a slightly higher compute cost. With dynamic batch sizing, you have more control over how much memory queries use. In this release, the Hash Join, Hash Aggregate, Project, and Union operators can be configured for a variable batch size.

Drill has an improved Parquet Reader

The Parquet Scan operator has been enhanced to:

  • push down filters on columns with nested types to the Parquet scan library. Examples:

  • push down filters on inner joined tables that have equality predicates only on one side of the join. Examples:

In these examples, Drill pushes down the predicates from one side of the join to the other side of the join. This now works regardless of how many joins there are in the query. Performance improvements observed for certain queries are shown in the tables below:

Other Improvements

  • Drill's underlying SQL parsing and planning library, Calcite, has been upgraded, providing expanded ANSI SQL compliance.
  • Storage plug-in handling improvements (DRILL-4580)

  • SP definitions now read through a file provided during Drillbit startup

  • Ability to export SP definitions from Web UI

  • DECIMAL is now a supported data type for reads and writes
  • Ability to export result set from Web UI to CSV with a user-specified delimiter

  • New aggregate function – ANY_VALUE()

Ex: SELECT col1, ANY_VALUE(col2) from <table> group by col1

picks 1 random value of col2 for every group of col1

And you'll find numerous more improvements. Download the latest version of Drill from our website.

The MapR Drill Product Team


This blog post was published December 18, 2018.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now