8 min read
We release updates to our core data platform every few months to provide customers with a fast path to the latest in community-driven innovation and contribution through an incremental release vehicle called MEP (MapR Ecosystem Pack). As the community issues new releases, the MEP consolidates and provides these to MapR customers in a timely manner. Last week, we announced the release of MEP 6.2 which includes updates to many new major ecosystem versions that are described below:
Now includes support for Hive metastore versions 2.2 and 2.3, new SQL functions that make working with arrays and pivoting data easier, AVRO, ORC and Parquet library updates. The complete list of improvements can be found at https://mapr.com/docs/61/EcosystemRN/SparkRN-2.4.0-1904-MEP6.2.0.html
As always, this revision to Drill comes loaded with numerous performance improvements, usability improvements in Web UI, support for statistics on parquet data, the ability to define schemas for text data and many more. Some of which are described below.
Statistics collection and usage - When queries are planned and optimized for execution, there are numerous decisions the query optimizer needs to make in order to pick the most efficient query plan possible. Without detailed statistics, query optimizers have to make decisions based on hard-wired rules and other hardcoded cost thresholds which are often incorrect. With statistics, the query optimizer can often make more informed decisions regarding join order, hash table size and degree of parallelism. Almost all of this decision making is automatically done without the end user having to do anything other than collecting statistics after a table is loaded which is standard practice in most other RDBMS engines. With this release, statistics can be optionally collected for parquet backed tables and is automatically used by the query optimizer to pick the best query plan
Parquet Metadata caching overhaul - Drill query optimizer uses the metadata available in Parquet footers and other metadata available in the MapR File System to make intelligent decisions on query execution like which file is to be read by which drillbit so remote reads are minimized, what parquet row groups can be filtered out completely, how many rows are expected to be processed so decisions on degree-of-parallelism can be made more efficiently, and so on. It does this by extracting this metadata both from MapR File System and Parqet footers and loading them in a JSON cache file through a REFRESH TABLE METADATA command. When the number of parquet files, directories, partitions or the number of parquet row groups is very high, this JSON file becomes very large resulting in a significant amount of planning time being spent in reading this JSON file. This release includes a major overhaul of how this metadata is captured and used. Firstly, it provides the ability to collect parquet column metadata only for "interesting" columns and not all columns. This is explained using the picture below.
Interesting columns are columns that are ordered sequentially across row groups in such a manner that the minimum and maximum values for this column do not overlap across row groups. For example if you have 200 columns in the table and only 1 column is actually 'ordered' the rest of the 199 columns’ metadata is unnecessarily kept in the cache file since the values for those columns are likely spread out randomly throughout the table and the parquet statistics don't help the pruning for those columns. This reduction in the cache file size results in substantially reduced processing time for scanning the metadata cache file as the picture below shows.
Secondly, the query optimizer decides whether it will read the cache file or not based on user-configurable thresholds. Thirdly, the directory-level metadata, file-level metadata, and column-level metadata are written to separate files so even if column-level row group pruning is skipped, file and directory-level metadata will still be used to do other things like locality assignment, row counts for degree-of-parallelism determination, and so on. Fourthly, there are now 2 additional options that control how and when row-group pruning is done — one to fall-back planning time row-group pruning to run-time row-group pruning another to disable it completely for tables that do not have any "interesting" columns.
Hive views can now be queried from Apache Drill 1.16 through the Hive storage plug-in.
select * from hive.`hive_view`;
Web UI enhancements - Numerous usability enhancements have been made to Drill’s wildly popular Web UI. Result set sizes can now be restricted through a radio button in the Web UI. This provides the user some protection against queries that accidentally return a very large result set overwhelming a slower browser. The user can, at their option, disable the radio button to prevent this behavior as the following picture shows
When query results are displayed through the Web UI, there’s also a link to the profile that just executed that query.
You can now sort ascending or descending on any column in the Profiles page to order query profiles by any user-defined criteria. You can use this to find out longest-running queries, queries sorted by the user ID that submitted them and other useful pieces of information.
And this is one of my favorite features. When you’re debugging query-performance issues for a query that’s running with a large number of fragments on a large cluster, you’re confronted with large tables full of numbers making it very hard to find out the largest or smallest number to identify the bottleneck. Often, I overlook the query bottleneck hiding in plain sight. In this release, query threads that are slower than others are marked with a special icon so threads that are running slower are easily identifiable. Consider the following snippet taken from a long-running query.
There are a few other improvements and numerous bug fixes. The full list of improvements can be found at https://mapr.com/docs/61/EcosystemRN/drill-1.16.0-1904-release-notes.html
Django has been upgraded to 1.11 which means Hue’s already robust backend gets even stronger with numerous backend enhancements to Django. Metadata embedded in Hive metastore can now be intelligently searched and a drag-and-drop interface builds the query text from the search results as is explained in this link. The complete list of improvements in this revision of Hue can be found in https://mapr.com/docs/61/EcosystemRN/HueRN-4.3.0-1904-MEP6.2.0.html
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.