11 min read
In Case You Missed My Release Pitch
We just released Apache Drill 1.12 on MapR 6.0 as part of MEP 4.1 (MapR Expansion Pack). Continuing with the Drill 1.11 theme that I outlined in my previous post here in late November, we have made improvements in the most recent release.
Here are the highlights:
Data Exploration on Operational Data on JSON Tables in MapR Database and Historical Data on Parquet in MapR XD
One of the key features of MapR Database and MapR XD is that it allows data scientists to reuse the same data for advanced analytics, such as machine learning, AI, or predictive analytics, without the need to export the data. Critical to designing new algorithms is prototyping where the focus is, to explore the data while running experiments. In MapR 6.0, we launched a new product called the MapR Data Science Refinery, an easy-to-deploy and scalable data science toolkit with native access to all platform assets and superior out-of-the-box security. To enable data exploration with Drill while prototyping algorithms, data scientists can use the same notebook in Apache Zeppelin to do in-place ad hoc SQL queries (as shown in Figure 1) and visualize the results.
From a technical standpoint, we enhanced the performance of exploratory queries on JSON tables in MapR Database by:
With this feature, Drill on JSON tables on MapR Database can leverage secondary indexes to improve performance for exploratory queries (that require no filters) and highly selective queries (that have filters) that require sorting, aggregation and joins.
Figure 1: Sample of modified TPCH exploratory SQL queries on JSON tables in MapR Database that would benefit from the performance feature.
Parquet, the columnar file format, is considered as a standard amongst our customers for historical analytics on the MapR Platform. To improve Drill's performance on Parquet, we conducted an investigation last year into the scanner itself that revealed the following:
Our tests show below that we are able to get about 2-4x improvement in scan performance of Parquet files. The performance gain will be most pronounced for SELECT * queries that need to scan the entire table.
Test Results of Running Exploratory Queries on Operational Data in MapR Database JSON Tables
This was the test setup that we put together to see how the combination of above performance optimizations would
benefit the sample queries:
The results are shown in Figure 2. All the queries show significant improvement in performance except for two
queries that require retrieving the maximum values. Such a query would require scanning the entire index table to
ensure that the maximum value was identified.
Figure 2: Performance test of a sample of modified TPCH exploratory SQL queries on JSON tables in
Test Results of Running Exploratory Queries on Historical Data on Parquet Files in MapR XD
This was the test setup that we put together to see how the combination of Parquet scanner optimizations would impact the performance of a SELECT * type of query:
The results of the testing is shown in Figure 3. We measured the performance gain of an individual scan fragment across all fragments for multiple runs of a query and observed the 2X factor improvement. However, as predicted, the overall query performance gain (30% in Figure 3) will be dependent on other factors such as filter complexity, aggregations, joins or sorting operations.
Figure 3: Performance test of an exploratory query on Parquet in MapR XD.
Wild Card Text Search Performance on Parquet Files
Unknown to many customers, Drill, much like standard SQL, has the ability to search text (see Figure 4) within a document as part of a filter. A "regular expression," specified as a grammar in the filter, can help detect text patterns. We introduced several improvements to this search in Drill 1.11 but tested it in this release. Prior to Drill 1.11, Drill used the Java Regular Expressions library for pattern matching. The library required that for each record, the data was copied from direct memory (area in memory that Drill controls) into the heap (garbage collector, not in Drill's control), and then the regular expression was evaluated, which hurt performance.
To improve upon this feature, we did the following:
We carried out tests in the same cluster as the tests for exploratory queries in MapR Database JSON tables described above. The queries were on TPCH dataset with a scale factor of 1000 and ParquetThe The results are shown in Figure 4. We see an increase in performance for regular expressions that had 1 wild card per word. As more % wild cards (i.e., any text) were present in the query, a full text scan had to be done, which hurt the performance. This is part of our roadmap for improvement for the next phase of this project.
Figure 4: Performance test of wild text card search queries on Parquet in MapR XDMapR XD.
Community Contribution Highlights
I am happy to report that the Apache Drill community has ramped up its activity in the last several months. In September of last year, we organized a Drill Developer Day that attracted users and developers around the Bay Area. I thought it would be worthwhile to highlight some of the contributions, as these are available in the current release as well. Note that we have not subjected these features to our internal testing and hence do not support it. But that should not deter you from trying them out and suggesting improvements through the dev and user mailing lists.
Here are the highlights:
It's Your Turn to Try
No release is ever complete without you giving it a try and seeing for yourself if it delivers the value you are looking for. With that, I invite you to give the new Apache Drill 1.12 on MapR 6.0 a try, and let me know your feedback.
The MapR Community pages for Apache Drill also have a lot of good material. Check them out here.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.