Top 10 Reasons for Using Apache Drill - Now as Part of MapR Distribution Including Hadoop

Contributed by

11 min read

Since Apache Drill 0.4 was released in August for experimentation on the MapR Distribution, there has been tremendous interest in the customer and partner community on the promise and potential of Drill to unlock the new types of data in their Hadoop/NoSQL systems for interactive analysis throughout the organization.

Today, we are excited to announce that Apache Drill 0.5 (recently released by the Apache community) is available and supported as part of the MapR Distribution along with the 15+ mainstream ecosystem projects already available.

Users can download Apache Drill on the MapR site.

A sandbox environment with pre-configured Apache Drill on MapR along with sample datasets is available to experiment with Drill.

There are several projects for SQL-on-Hadoop. What makes Drill different? What are the top 10 reasons why Drill is a valuable and innovative technology in your tool belt for interactive data exploration on big data?

Apache Drill

quick and easy ramp up for apache drill

1. Quick and easy ramp up

First and foremost, it takes just minutes to start working with Apache Drill. Install it on a local Windows or Mac machine and do queries right away–you don’t even need Hadoop.

Here are three simple steps to do your first query with Drill.

// Install, launch SQLLine CLI and query a JSON file on local file system
$ tar -xvf apache-drill-0.5.0-incubating.tar                   
$ apache-drill-0.5.0-incubating/bin/sqlline -u jdbc:drill:zk=local
0: jdbc:drill:zk=local> SELECT * FROM cp.`employee.json` limit 5;
| employee_id | full_name        | first_name | last_name  | position_id | position_title       |  store_id  | department_id | birt

| 1           | Sheri Nowmer     | Sheri      | Nowmer     | 1           | President            | 0          | 1             | 19   
| 2           | Derrick Whelply  | Derrick    | Whelply    | 2           | VP Country Manager   | 0          | 1             |
| 4           | Michael Spence   | Michael    | Spence     | 2           | VP Country Manager   | 0          | 1             |
| 5           | Maya Gutierrez   | Maya       | Gutierrez  | 2           | VP Country Manager   | 0          | 1             |
| 6           | Roberta Damstra  | Roberta    | Damstra    | 3           | VP Information Systems | 0        | 2             |

2. Supports ANSI SQL – as you know it

Apache Drill is compatible with ANSI SQL standards. This means that users don’t need to learn a new query language or know the nuances of “SQL Like” to work with Drill or migrate existing workloads to Drill.

Drill supports SQL 2003 syntax and provides all the key SQL data types (such as DATE, INTERVAL, TIMESTAMP, VARCHAR, DECIMAL) and query constructs (such as correlated sub-queries, joins in WHERE clause) to provide a smooth and familiar analytics experience.

Here is an example of a TPC-H standard query that runs in Drill “as is”.

# TPC-H query 4
SELECT  o.o_orderpriority, count(*) AS order_count
FROM orders o
WHERE o.o_orderdate >= date '1996-10-01'
      AND o.o_orderdate < date '1996-10-01' + interval '3' month
                 SELECT * FROM lineitem l
                 WHERE l.l_orderkey = o.o_orderkey
                 AND l.l_commitdate < l.l_receiptdate
      GROUP BY o.o_orderpriority
      ORDER BY o.o_orderpriority;

3. Works with your BI tools

Apache Drill integrates with the BI/SQL tools such as Tableau, MicroStrategy, Pentaho and Jaspersoft using JDBC/ODBC drivers. This means that users can now use same BI/Analytics tools they are deeply familiar with in order to perform proactive business intelligence using more raw data, up-to-date data and new types of data available in Hadoop/NoSQL stores at a significantly low cost and rapid time to market.

Here is a quick look at the Drill ODBC Driver DSN UI, a data exploration environment ‘Drill explorer’ to understand Drill data and create views along with a BI visualization using Drill as a data source.

MapR Drill ODBC Driver DSN Setup

data exploration enviroment

Tableau example

4. Supports self-describing data with no ETL

Self-describing data is where schema is specified as part of the data itself. File formats such as Parquet, JSON, Protobuf, XML, Avro and NoSQL databases are all examples of self-describing data. Some of these data formats are also dynamic and complex in that every record in the data can have its own set of columns/attributes and each column can be semi-structured/nested.

Think about a JSON document with multiple levels of nesting and optional/repeated elements at each level or a wide HBase table with 100s-1000s of columns with varying schema across rows. How about third party data that you are looking to leverage in BI/Analytics, but you have no control on how schemas will evolve?

Drill supports querying self-describing data without defining and managing any centralized schema definitions in Hive metastore. Schema is discovered dynamically on the fly when the queries come in.

Dynamic schema discovery with no upfront modeling/schema management means that companies now can eliminate time delays of weeks/months of ETL before data is available to users for data exploration. Users can get more up-to-date/real-time data in order to make informed and timely decisions.

Here are a few quick examples on querying files and directories using Drill.

//clicks.json is a file and logs is a partitioned directory by year & month on MapR file system

0: jdbc:drill:> select * from `clicks/clicks.json` limit 2;
0: jdbc:drill:> select cust_id, dir1 month_no, count(*) month_count from logs
where dir0=2014 group by cust_id, dir1 order by cust_id, month_no limit 10;

5. Handles Complex Data Types

Drill comes with a flexible JSON-like data model to natively query and process complex/multi-structured data. The data doesn’t need to be flattened or transformed either at the design time or runtime providing high performance for queries. Drill provides intuitive extensions to SQL to work with nested data using MAP and ARRAY data types.

Here is an example indicating how Drill queries a JSON file and accesses the nested maps and array fields.

// prod_id is an array field in clicks.json file  

select * from (select t.trans_id, t.trans_info.prod_id[0] as prodid,
t.trans_info.purch_flag as purchased
from `clicks/clicks.json` t) sq
where sq.prodid between 700 and 750 and sq.purchased='true' order by sq.prodid;

6. Plays Well with Hive

Apache Drill lets you reuse investments made in existing Hive deployments. You can do queries on Hive tables and access 100+ Hive input/output formats (including custom serdes) with no re-work. Drill serves as a complement to Hive deployments by offering low latency queries.

Here is a sample Hive storage plugin configuration looks like in Drill, followed by a query on a Hive table.

//Storage plugin configuration for Hive

 "type": "hive",
 "enabled": true,
 "configProps": {
   "hive.metastore.uris": "thrift://localhost:9083",
   "hive.metastore.sasl.enabled": "false"

//Query on a Hive table 'orders'
0: jdbc:drill:> select `month`, state, sum(order_total) as sales from hive.orders
group by `month`, state order by 3 desc limit 5;

7. Works with Hadoop and Beyond

Drill is designed with extensibility in mind. It provides out-of-the-box connectivity to file systems (local or distributed file systems such as S3, HDFS, MapR XD), HBase, or Hive. However the storage plugin interface is extensible to other NoSQL stores (such as Couchbase, Elasticsearch, MongoDB) or relational databases (such as Postgres, MySQL, etc.) or your own custom store. Drill can also combine data from all these data sources in a single query on the fly without any central metadata definitions.

Here is an example Drill that combines data from Hive, HBase and JSON.

// Hive table 'orders', HBase view 'custview' and JSON file 'clicks.json' are joined together

select custview.membership, sum(orders.order_total)
as sales from hive.orders, custview, dfs.`/mapr/` c
where orders.cust_id=custview.cust_id and orders.cust_id=c.user_info.cust_id
group by custview.membership order by 2;

8. Ease of UDFs

Drill exposes an easy and high performance Java API to build custom functions (UDFs and UDAFs) and extend SQL for the data and the business logic that is specific to your organization. If you have already built UDFs in Hive, you can reuse them with Drill with no modifications. Refer to Developing Custom Functions for more information.

9. Provides low latency queries

Drill is built from the ground up for short and low-latency queries on large datasets. Drill doesn’t use MapReduce; instead it comes with a distributed SQL MPP engine to execute queries in parallel on a cluster. Any of the Drillbits (core service in Drill) is capable of receiving requests from users. The optimizer in Drill is sophisticated and leverages various rule- based and cost-based techniques, optimization capabilities of the data sources, along with data locality to determine the most efficient query plan and then distribute the execution across multiple nodes in the cluster. Drill also provides a columnar and vectorized execution engine to offer high memory and CPU efficiencies along with rapid performance for a wide variety of analytic queries.

10. Supports large datasets

Drill is built to scale to big data needs and is not restricted by memory available on the cluster nodes. For performance, Drill tries to do query execution in-memory when possible, using an optimistic/pipelined model and spills to disk only if the working dataset doesn’t fit in memory.

For more examples on how to use Drill, download Apache Drill sandbox and try out the sandbox tutorial. Refer to Apache Drill web site for additional information.

We also have an on-demand webinar that drills into details of Drill.

Congratulations and thanks to the rapidly growing Drill community on this groundbreaking 0.5 release!

This blog post was published September 16, 2014.