11 min read
In this week’s Whiteboard Walkthrough, Neeraja Rentachintala, Senior Director of Product Management at MapR Technologies, gives an overview of how open source Apache Drill achieves low latency for interactive SQL queries carried out on large datasets. With Drill, you can use familiar ANSI SQL BI tools, such as Tableau or MicroStrategy, plus do exploration directly on big data.
For additional material:
Here's the unedited transcription:
Hello, my name is Neeraja Rentachintala, and I’m part of the product management team at MapR. Today, I will be briefly talking about how Apache Drill achieves low latency performance on large-scale datasets.
For those of you who are not familiar with it, Apache Drill is an open source, interactive SQL-on-Hadoop query engine, with which you can do data exploration as well as BI and ad-hoc queries directly on top of big data, using familiar ANSI SQL tools such as Tableau and MicroStrategy.
In the next few minutes, I will be talking about some of the core architectural elements, as well as features, which enable Drill to provide this interactive performance experience.
At the heart of it, Drill is a distributed SQL query engine. The core daemon in Drill is called "Drillbit." This is a service that you would install on all the data nodes in the Hadoop cluster. It's not a requirement to install Drill on all the data nodes. You can do certainly a partial set of nodes, but installing on all the nodes gives Drill the ability to achieve data locality at the query execution time.
Data locality is the ability to push down the processing to the node where the data lives, rather than trying to bring the data over the network at the query execution time.
So let's take a look at the query execution flow. The SQL where it comes in from a client - this could be a JDBC client, ODBC, CLI, or a REST endpoint. The query comes in, and the query is accepted by a Drillbit. The Drillbit that accepts the request acts as a foreman or coordinator for that specific request. As a client application, you could directly submit to a Drillbit, or you could talk to your ZooKeeper quorum, which in turn would draw the request to a specific Drillbit.
Once a Drillbit receives the request, it passes the query and then it determines what is the best way to execute this particular query, or what is the most optimal way to execute the query.
So Drill allows us to do a variety of rule-based as well as cost-based optimizations, in addition to being aware of the data locality during the query planning.
Once the best query plan is determined, Drill splits the query plan into a number of pieces called fragments, query plan fragments. So the Drillbit, the coordinator Drillbit, talks to the ZooKeeper, determines what are the other Drillbits that are available in the cluster. Then it gets the location of the data. By combining that information, it determines what are other Drillbits that can handle this particular list of query plan fragments. Then it distributes the work to the other Drillbits in the cluster. Each Drillbit does its own processing of the query plan fragment. They return the results back to the original Drillbit, and the Drillbit returns the results to the client.
So during the execution, each of the Drillbits would be interacting with the underlying storage systems. So, this could be files. (As I have written here, this is DFS, which represents a distributed file system). It could be HBase, it could be HIive, it could be MongoDB, it could be MapR Database, it could be any other storage system Drill is configured to work with.
An important thing to realize here is that each of these Drillbits are the same. So this is a completely scaleout MPP architecture. So the request from the client can go to any Drillbit; that is how Drill scales. There is no master-slave architecture. So today, I could deploy Drill on one node that is sufficient for my needs. I can grow it to ten nodes tomorrow. I can grow it to one hundred nodes, a thousand nodes. Depending on the amount of data that you want to process, depending on the number of users you want to support, depending on the performance that you want to achieve, that is the goals that you want to meet. You can scale the Drill cluster, you can shrink the Drill cluster without hitting any bottlenecks, either on the networking side or on the processing side. So this is a completely scaleout MPP query engine. So this is the foundational element during the Drill execution which allows Drill to provide performance.
In addition to the core distributed execution, there are a variety of other execution and architectural elements which allow Drill to provide this performance. I will briefly describe each of these characteristics in detail.
The first aspect here is columnar execution. Drill is columnar in storage as well as in memory. So what does that mean? Drill is optimized for columnar formats such as Parquet. So if I am querying Parquet data files with Drill, it saves on disk I/O by reading only the specific columns that are required to satisfy the query. It's not just that. Once the data is read into the memory, Drill continues to keep the data in a columnar format. So the internal data model for Drill is an in-memory, columnar, hierarchical data model. So the benefit is with the columnar execution, Drill doesn't need to materialize the data into a new format. So it is doing joins, aggregation, sorting, all the SQL operations directly on the columnar data, without having to change the structure. So this not only just gives performance, it also saves the memory - the memory footprint that Drill has to occupy at the query execution time. So columnar execution is a very foundational element in achieving performance.
The next aspect, which goes very much hand-in-hand with the columnar execution is vectorized processing. This is actually a pretty common technique in database systems, in the traditional MPP database systems. But this is fairly new in the context of Hadoop. The idea of vectorized processing is, rather than operating on a single-value, from a single record at a given time, Drill allows the CPU to operate on what is called "vectors." These are also called record batches. The vector is basically an area of values from a set of records in a table. Think about doing a null check on a billion records. The vectorized processing, the technical premise for it, is to make sure that you can really take advantage of the modern CPU designs and deeply-pipelined CPU architectures. Keeping these CPU pipelines busy, to achieve the performance as well as the CPU efficiencies, is not something that you would traditionally get with traditional RDBMS kind of systems. That’s because of the core complexity, which is where the performance benefits of Drill really shine.
Some of you might have heard about a recently announced project called Apache Arrow. The in-memory data model for Drill, as I was mentioning, is a columnar hierarchical data model. Now this is actually spun out as a separate project and contributed to the community as a new open source project called Apache Arrow. The goal of that project is to really standardize the data representation across a variety of storage systems as well as processing frameworks, such a Kudu, Impala, Spark, etc. So this is kind of the standard for the big data analytics.
Moving on, the next aspect here is optimistic/in-memory execution. First of all, one thing to realize is that Drill does not use MapReduce, and Drill does not use Spark. This is an engine that is built from the ground up for performance on complex as well as hierarchical datasets. What does it mean by "optimistic"? This means that Drill assumes that the chance of getting a failure, such as hardware failures or node failures, is uncommon during the short execution of a query. It tries to do as much execution as possible in-memory without writing anything to disk for checkpointing and recovery purposes. The idea is that the model that is used from a technical term is called "pipeline execution." Rather than scheduling one task at a time, Drill, in a pipeline fashion, schedules all the tasks at once. And these tasks are getting executed, data is getting moved between the various pipelines, trying to do everything in in-memory, and writing to disk only if there isn't sufficient memory. This again is a very, very core aspect of achieving seconds performance, rather than a minutes kind of performance that you would see with traditional batch-based paradigms.
So columnar execution, vectorized processing, optimistic in-memory execution - they all compliment the distributed query execution capabilities. There are a lot more. One example is runtime code compilation. Again, runtime code generation is very efficient, compared to the interpreted kind of query execution. Drill allows you to actually generate very, very highly efficient custom code for every query and for every operator dynamically. This again is a very efficient way to execute the query. There is a lot more, and you can certainly read the documentation for more information, but I will say that the distributed execution, and the combination of columnar execution, vectorized processing, and in-memory execution, form the foundation for Drill execution around performance.
There is a lot that Drill does around optimization in order to achieve performance. We will cover that in a separate video. Please check out mapr.com. Thank you for watching!
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.