Lessons Learned from Apache Drill Production Deployments

Contributed by

4 min read

It’s been a little over a year since Apache Drill first went GA. There’s a great infographic from a few months back that highlights the key product milestones for Drill and customer use cases during the last year. The process of enabling customers to get Drill into production has resulted in the larger Apache Drill team developing a number of best practices. One of the customer questions has centered around wanting to understand how to determine the degree of parallelism being used for various operators in queries. We’ll address this question and the best practice that originated from this in the rest of this blog post.

Before we get started, be sure to watch the on-demand Drill webinar, where you’ll learn about best practices and guidelines for data layout, tuning queries for performance, security, and other topics.

In Drill, a query is broken into logical boundaries called major fragments, which are a collection of multiple minor fragments. A minor fragment is where the execution of a particular slice of data is happening for that particular major fragment. Minor fragments constitute of one or more operators which are executed sequentially over a slice of data.

To find out how parallel the query execution happened, you need to look for the number of minor fragments in a major fragment. Let’s illustrate this with an example and consider the information provided in the query profile.

Drill deployments

In the above table, fragment 00 (the one which is displayed is the major fragment id; i.e., 00 in 00-xx-xx) constitutes only one minor fragment. But for fragments 01 and 02, the work is broken down into 42 and 4 minor fragments, respectively.

To find out how parallel an operator in a physical plan that has been executed is, you need to figure out which major fragment is the operator located in the plan.

Let’s say there is a hash join in the plan, and you want to know how parallel the hash join that has been executed is.

01-xx-03 - HASH_JOIN

Drill deployments

From the numbering of the hash join (01-xx-03), it can be observed that the hash join is located in 01 Major Fragment. Therefore, it can be concluded that the hash join was 42 way parallel.

One caveat here is that some major fragments themselves can be executed in parallel, so the maximum parallelism for the query is very hard to compute. It can range from the lowest parallelism of a major fragment, to the sum of parallelism of the major fragments, depending upon which major fragments are executed in parallel.

If you’re interested in more best practices for taking Apache Drill into production, be sure to watch the on-demand webinar on July 27th which will cover best practices and guidelines for data layout, tuning queries for performance, security, and other topics.

This blog post was published July 20, 2016.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now