14 min read
It's been 3 months since the last Drill release, and here we are again with another one! We are pleased to announce the release of Apache Drill 1.15 as part of the MEP 6.1 (MapR Ecosystem Pack) release in early February. Drill continues to gain in popularity and adoption, both in the open source community and with MapR customers. There have been several improvements in this version of Drill that are described in detail below.
Query planning optimizations that leverage MapR Database's secondary indexes, open sourced
New FILES table in INFORMATION_SCHEMA
You can now use Drill to query data in S3 through the new cloud storage plugin for S3. This is a major step forward for Drill in integrating with cloud applications. With the ability to run ANSI SQL queries on preexisting data in S3, Drill provides interoperability for cloud applications that write to S3 in formats that Drill can read.
Here's how it works: you configure the AWS credentials in the S3 plugin or in Drill's core-site.xml as documented here, and now you can access any file on the S3 bucket you just configured using Drill! The plugin is secure, fully read/write – you can read from tables using SELECTs or write to tables using CTAS (CREATE TABLE AS) – and is fully integrated with system tables in INFORMATION_SCHEMA.
If you want to know what capabilities this provides to your big data environment, keep reading.
MapR already has storage tiering solutions built into the platform by providing the ability to transparently offload or recall data between a MapR volume, an erasure coded volume, and S3. This was released as part of MapR v6.1, and this article explains its workings in great detail. But at a high level, when you have storage tiering at the file system level, the MapR distributed file system offers a single unified view of your data, transparently moving the data between the different storage tiers as needed to serve files. The data that's offloaded to S3 as a result of this is in a format only MapR can read. When you need the ability for other cloud applications to be able to process this data in S3, that's when you use Drill. When you use Drill to offload data from a MapR cluster to S3, it is readable by any reader who can read from S3!
The following diagram shows how the data flows between a MapR cluster and S3.
You could also use Drill to pull in data that has been archived previously into a Drill cluster again to perform any necessary ad hoc analysis. I'd like to share a real-world example from a prior experience. I was working for a large bank a few years ago, when we were tasked with performing an ad hoc analysis of a large data set that had already been backed up to tape to satisfy a regulatory reporting requirement. This required physically transporting the tapes from the backup data center to the production data center and mounting them as devices on the application server. These would then be restored, one at a time, onto specially installed RDBMS instances, where the ad hoc analysis would be performed. The whole initiative took about 3 months. Had the data been offloaded to S3 storage instead of tapes, which offers about the same level of reliability if not higher, all that would separate the ad hoc reporting requirement and the data in S3 would be a Drill query on a MapR cluster:
CREATE TABLE dfs.archive.my_restore_of_dec2008 AS SELECT * From s3.archive_dec2008.db_bkp_parquet ;
There are numerous use cases like this one that can be limited only by your imagination. If you want to get going by running SQL against S3, here's a cool video demo to get you started: Apache Drill accessing JSON tables in Amazon S3 video demo.
Parquet files are made of one or more row groups, and each row group has some metadata in their footers, like number of rows, minimum and maximum values for each column in the row group, and the number of NOT NULL values. An example row group footer is shown below.
[order_id] INT32 14.1% of all space [PLAIN, RLE, BIT_PACKED] min: 24.582M max: 24.582M average: 24.582M total: 24.582M (raw data: 24.579M saving 0%) values: min: 6.144M max: 6.144M average: 6.144M total: 6.144M uncompressed: min: 24.579M max: 24.579M average: 24.579M total: 24.579M column values statistics: min: 1006, max: 70000995, num_nulls: 0
Starting with Drill 1.8, the column minimum and maximum values in the footers are used by the query optimizer to prune out row groups that do not match the query's filter criteria. For example, if the planner is considering the row group shown above for a query with the following predicate,
WHERE t.order_id between 200 and 999 ;
the row group can be eliminated as the minimum and maximum values for the column are outside the range predicate used in the query. This is done during planning so the run time does not waste any time trying to read this row group. This release includes support for row group pruning for STRING and DECIMAL data types.
To get the best results from this feature, the data has to be sorted on the column for which row group pruning is desired. This makes the row groups range partitioned and very effective in pruning row groups out as the minimum and maximum values tend to stay within a close range and predicates filter out many of the row groups, resulting in significantly improved performance as the following examples show. Note in the example below, the table is loaded and sorted on the columns that are used in the query's WHERE clause.
Until Drill 1.14, if you wanted to do a Cartesian product of two tables in Drill, you would have to join on a constant column in both tables. If one didn't exist, you could apply functions that convert a column's value to a constant and join on that constant column as the following example shows:
0: jdbc:drill:drillbit=10.250.100.31> select count(*) from `cp`.`employee.json` t1 inner join `cp`.`employee.json` t2 on t1.employee_id*0=t2.employee_id*0; +----------+ | EXPR$0 | +----------+ | 1334025 | +----------+ 1 row selected (0.813 seconds) 0: jdbc:drill:drillbit=10.250.100.31>
This causes performance issues as the hash join applies a hash function on a constant value, resulting in severe processing skew. In this version of Drill, I'm so happy that this is no longer needed because Drill now naturally supports the CROSS JOIN ANSI SQL construct. By implementing CROSS JOIN natively, it fully takes into account the compute resources available cluster-wide and gets fully parallelized in the query plan. CROSS JOIN is frequently used by BI tools to get a denormalized extract of an underlying data model, so optimizations can be applied by caching in a reporting format. This is a major milestone in integrating with many BI tools.
Another cool feature that has gone into Drill 1.15 is a small change in how in-memory hash tables are built, which results in big improvements in join processing. For example, a plan is created for a query of the following form:
SELECT c.customer_id, c.customer_name FROM customers c WHERE c.status_id in (select cs.status from c_status cs where cs.type=’S’) ;
When the plan is generated, to remove duplicates in the in-memory hash table, a select DISTINCT is done on the c_status table, so the in-memory hash table is not bloated because of duplicates. This deduplication can be an expensive operation that takes up memory and CPU ticks. An optimization is included in this version to not do the DISTINCT upfront but rather do this transparently, when the hash table is built. This not only results in improved performance but also reduces memory usage. The table below shows the performance improvement seen on some queries as a result of this improvement in the TPC-DS benchmark.
You can now include positional and column aliases in GROUP BY and ORDER BY. ANSI SQL allows a shorthand notation of including columns in GROUP BY and ORDER BY clauses by virtue of their position in the SELECT list or the names they are being aliased to. Drill now has the ability to include both column aliases and positional aliases in GROUP BY, HAVING, and ORDER BY clauses. The examples below show the queries that will now work with Drill 1.15 that wouldn't in prior versions.
select length(n_name), n_regionkey from cp.`tpch/nation.parquet` group by 1, 2;
--Positional aliases in GROUP BY
select length(n_name) as len, n_regionkey as key from cp.`tpch/nation.parquet` group by len, key;
-- Column aliases in GROUP BY
select length(n_name) as len, count(*) as cnt from cp.`tpch/nation.parquet` group by length(n_name) having cnt > 1;
-- Column aliases in HAVING
select n_regionkey, n_name from cp.`tpch/nation.parquet` order by 1, 2;
-- Positional aliases in ORDER BY
select n_regionkey as r, n_name as n from cp.`tpch/nation.parquet` order by r, n;
-- Column aliases in ORDER BY
A couple of new functions have been added that make manipulation of TIMESTAMP values easier. The following function allows an interval to be added to either a DATE or a TIMESTAMP:
SELECTTIMESTAMPADD(DAY, 3, DATE '2018-12-31')
The other lets you find the difference between two dates or two timestamps or between a date and a timestamp and displays them in any date or timestamp component – DAYS, HOURS, MINUTES, SECONDS, MONTHS, YEARS:
SELECTTIMESTAMPDIFF(HOUR, TIME '18:00:03.600', time_col)
The Web UI keeps getting loaded with features! This release packs in the following enhancements to the Web UI:
In keeping with MapR's open source commitment, the query planning and optimization enhancements to support MapR Database's secondary indexes, which were previously available only to MapR customers, have been open sourced. With this, we encourage the developer and open source community's participation in improving usability and rolling out new features that leverage secondary indexes.
A new system table, FILES, has been added to INFORMATION_SCHEMA system tables. This system table will show files and directories that exist in all workspaces that are accessible to the current user. There is an option (not default) that, when enabled, descends recursively into directory structures to display nested directories and files. It also has the ability to discover and display files from the new S3 storage plugin.
The MapR Drill Product Team
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.