11 min read
In this week's Whiteboard Walkthrough video, Neeraja Rentachintala, Senior Director of Product Management at MapR Technologies, explains how Apache Drill optimization achieves interactive performance for low latency SQL queries on very large data sets when working with familiar BI tools such as Tableau, Microstrategy or Qlikview and includes techniques used for successful optimization using Drill in production. Neeraja describes Drill optimization capabilities based on Apache Calcite that include projection pruning, filter push down, partition pruning, cost-based optimization and meta-data caching.
For more on Apache Drill go to:
The full video transcription follows:
Hi everybody, my name is Neeraja Rentachintala; I am in product management team at MapR-. Today, I'm going to briefly talk about how Apache Drill achieves interactive performance on large-scale data sets.
For those of you who are not familiar with Drill: Drill is an open-source, low-latency SQL- on-Hadoop query engine, designed to provide data exploration as well BI and Ad-Hoc queries capabilities directly on top of big data, using very familiar SQL tools such as Tableau, MicroStrategy, and QlikView. In the next few minutes, I am going to briefly introduce some of the key optimization capabilities in Drill.
The foundation of Drill optimization comes from Apache Calcite, which is open-source project which allows you to parse the SQL queries, as well as do a lot of optimization on top of the queries. This serves as the foundation for Drill logical as well as physical planning of a given SQL query. In addition to taking Calcite and applying it in the Drill, we also extended it to make it more suitable for distributed environment, distributed query execution, which is what Drill provides.
On top of it, Drill does a variety of rule-based as well as cost-based optimizations. I listed some of the examples here. The first one is projection pruning, so the idea here is: if I have a table with thousand columns, and if my query is retrieving only three columns out of those thousand columns, projection pruning allows Drill to query only those three columns, rather than trying to retrieve the thousand columns from the disk. This is a huge savings in terms of the IO, disk IO, that you would otherwise have to spend a lot of time querying.
The second aspect is filter pushdown. Drill is a distributed query engine on top of data sources, so if the underlying system is capable of handling certain operations, Drill has the ability to push down those capabilities. Filter is a great example of it. For example, if Drill is querying on top of Parquet, Drill is querying on top Fishbase, on MongoDB, on an RDBMS system ... Drill has the ability to understand the query, take the filters, and push it down to the underlying storage systems, so that the data gets filtered at the storage layer, rather than bringing over the network and trying to do the filtering on the Drill side, which again is a huge savings in terms of IO as well as your network bandwidth.
The next aspect is partition pruning. In the past one to one-and-a-half years, after Drill has been widely used by a variety of customers, this is one of the key techniques that I have seen customers very much benefiting in order to achieve interactive performance on using Drill. Hadoop is all about large-scale data, so think about, you're an organization trying to store huge amounts of sales data in Hadoop. For example, I'm storing 30 years’ worth of data here. Let me give an example of how partition pruning can actually help you with respect to achieving performance for Drill queries.
My data is organized and the sales data is organized in this fashion. I have, first of all, organized data as years, followed by months; so I have year 1 data, year 2 data, going all the way to, let's say, year 30 data. Within a year, I have organized the data as months, and within months, I have organized the data as days. This is my billions of records of data that is residing in Hadoop.
Now, my user, whether this could be a Tableau user, MicroStrategy user, my data scientist, a QA engineer, or developer comes in and they're trying to do queries on this particular data, ad-hoc queries. If for example, if they're querying only the last month worth of data, there is no point in going after this 30 years’ worth of data and querying it. Drill has ability to understand that the query only requires one month worth of data, so at the planning time, Drill prunes out the entire partitions except this particular year worth of data. If a month worth of query is coming, all Drill has to do is a month worth of data. For example, if particular user is querying for a specific day, then I have to read only one day worth of data.
This is huge. We have seen customers coming all the way from minutes to literally seconds and sub-second response time with this kind of an architecture. The important thing to think about it, is, how do I organized my data? What is the right data layout so that I can help Drill to achieve performance. This is not something very unique to Drill, but this is a very standard way to achieve performance in large-scale data sets, and Drill, of course, can do that as well. This is partition pruning, again, a very important technique when you are trying to achieve performance on using Drill on huge amounts of data.
Continuing here, so we just talked about projection pruning, filter pushdown, partition pruning; Drill also has the ability to do cost-based optimization. Cost-based optimization ... Again, we are talking about thousand node clusters, hundred node clusters, so this is not a single node cost-based optimization. We are talking about optimizations which are in a distributed environment. The idea is to leverage the statistics of the underlying data sources to determine what is the right query plan.
For example, I might have to reorder certain joins. If I have an eight-table join query, the way you reorder the joins in the query is very important to the performance. The way most of the systems do, is using a cost-based optimization technique. Drill when querying Hive, can leverage the Hive statistics; when querying file system, it can leverage the statistics that are available with the file, such as file size, number of rows ... Whatever are the statistics that are available with underlying storage, Drill can leverage it. Cost-based optimization is a very, very important technique during the planning phase to achieve performance.
Another aspect is metadata caching; so, the combination of metadata caching and the partition pruning can really do magic. We have seen customer situations where, they are trying to query hundreds of millions of files using Drill. This could be Parquet files, mainly Parquet files from performance standpoint ... At the planning time, if you don't have metadata catching enabled, because Drill is a schema-less system, it has to go and read the footers all these files trying to plan the query, which again, could be a pretty time-consuming task.
The good thing with Drill is, if you don't have any idea about the data that you're querying, it allows you to do that; but if you know what the data is and what the metadata is, you have the ability to optimize it. This is one such example. If you know what metadata you're querying, you can go and enable this meta data caching. This is very easy to do; it is simply a SQL statement ... I believe it's called refresh metadata table command ... that you would execute and run this cache. What it allows you to do is, it caches the metadata rather than asking Drill to reach out to the underlying storage system and retrieve the metadata each time a query comes in.
This metadata is a distributed cache, so it is available to all the Drill nodes in the cluster. It is very fast, highly consistent metadata cache. Again, I would like to repeat that the caching and the partition pruning are the fundamental things that you want to make sure when you are targeting performance in Drill. This, in combination with a variety of the other things, but from a data layout perspective, these are the things to focus on.
There are obviously a lot more techniques. Drill is a RDBMS engine, at the core of it is a distributed query engine, so it does a variety of other optimizations, but based on the past customer installations and customer deployments who are successfully using Drill in production, these are some of the things to pay attention to.
Thank you for watching, we have another video on talking about the Drill query execution capabilities, I highly recommend watching it on mapr.com.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.