Apache Drill 1.6 and the MapR Data Platform - The Emergence of a New Generation Stack for “JSON-Enabled” Big Data Analytics

Contributed by

12 min read

Today we are very excited to announce the release of Apache Drill 1.6 on the MapR Data Platform. Drill has been on the path of rapid iterative releases for one and a half years now, gathering amazing traction with customers and OSS community users on the way. The community has delivered 13 releases since the beta launch in September 2014, and they will continue to move forward at a fast pace.

Here are the highlights of the Drill 1.6 release.

  • New storage plugin for MapR Database (document database)
  • Enhanced stability and scale with a new and improved memory allocator
  • Enhanced query performance
    • Query planning speedups via early application of partition pruning
    • Faster query planning on Hive table queries
    • Optimized reading of Parquet metadata cache
    • Row count-based pruning to speed up Limit N queries
    • Optimized Tableau experience with limit 0 performance improvements
  • End-to-end security from BI tool to MapR/Drill with client impersonation (this is supported only with JDBC; a new ODBC driver with this feature will follow shortly).
  • New SQL Window function frame syntax
    • RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    • ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    • RANGE BETWEEN CURRENT ROW AND CURRENT ROW
  • JDK 1.8 support
  • Many more bug fixes and enhancements

As you can see from the above list, the Drill 1.6 release is another critical stepping stone in driving ANSI SQL support and performance/scale to the next phase.

At MapR, we are also very excited about this release due to the unique value it brings to you by combining Drill with the recently announced MapR Database document database capabilities as part of the MapR 5.1 release. A new MapR Database document database format plugin is introduced in Drill 1.6, which enables you to query JSON tables in MapR Database directly. This means that no ETL and no transformation is required at any layer. This combination gives you end-to-end flexibility when it comes to JSON in order to store, update, and query the data in its natural form and fidelity at Hadoop scale in global environments using familiar ANSI SQL capabilities. The result is that you now have operational analytics capabilities, and most importantly, are able to quickly adapt to changes to data models in the underlying applications. This agility means you’ll realize much faster time to value.

Drill is designed with JSON and schemaless/semi-structured data at its core, and it already is able to query and manipulate raw JSON files in MapR XD. This is now extended to MapR Database for operational and fast-changing data. We will be extending the power of Drill very soon to MapR Event Store, a global publish-subscribe messaging framework, so SQL can be used to query real-time streaming events. Essentially, Drill becomes the unified, high- performance, and flexible SQL access layer across files, tables and streams in the MapR Data Platform.

You can try out the Drill+MapR Database JSON capabilities for yourself by downloading the MapR Data Platform community edition from https://mapr.com/download/

Let me demonstrate a few Drill queries on MapR Database JSON tables. For these examples, I am using a Yelp dataset.

The first step is to ingest the JSON data into MapR Database, which you can do using the OJAI API or using an out-of-box importJSON utility that comes with MapR.

  • Using OJAI API

    • Bulk load from file
Table table = MapRDB.getTable(“/maprdb/json/yelp");
RecordStream jsonStream = Json.newDocumentStream(fs, "/business.json");
table.insertOrReplace(jsonStream);
  • Single record inserts from apps
Table table = MapRDB.createTable(“/my_app/user_profiles”);
Document record = MapRDB.newDocument()
                 .set("firstName", "John")
                 .set("lastName", "Doe")
                 .set("age", 50);
table.insert("jdoe", record);

  • Through ImportJSON utility
mapr importJSON -src /maprdb/json_files/yelp/business -dst /maprdb/json/yelp/business  -mapreduce false -bulkload false -idfield business_id

From the MapR administration console, you can see the MapR Database JSON tables. MapR Database now has multiple data models: binary-like HBase, and JSON-like MongoDB, and you can configure the “type” on a per table level.

Let’s quickly ensure that the data is indeed loaded into DB correctly using the MapR Database shell. This is not a prerequisite to query with Drill, but I am using it to simply demonstrate different ways of interacting with MapR Database.

**maprdb root:> find '/maprdb/json/yelp/business' --limit 1**
{"_id":"--1emggGHgoG6ipd_RMb-g","business_id":"--1emggGHgoG6ipd_RMb-g","full_address":"3280 S Decatur Blvd\nWestside\nLas Vegas, NV 89102","hours":{},"open":true,"categories":["Food","Convenience Stores"],"city":"Las Vegas","review_count":4,"name":"Sinclair","neighborhoods":["Westside"],"longitude":-115.2072382,"state":"NV","stars":4,"latitude":36.1305306,"attributes":{"Parking":{"garage":false,"street":false,"validated":false,"lot":true,"valet":false},"Accepts Credit Cards":true,"Price Range":1},"type":"business"}

1 document(s) found.

**maprdb root:> findbyid /maprdb/json/yelp/business --id "--1emggGHgoG6ipd_RMb-g"**
{"_id":"--1emggGHgoG6ipd_RMb-g","business_id":"--1emggGHgoG6ipd_RMb-g","full_address":"3280 S Decatur Blvd\nWestside\nLas Vegas, NV 89102","hours":{},"open":true,"categories":["Food","Convenience Stores"],"city":"Las Vegas","review_count":4,"name":"Sinclair","neighborhoods":["Westside"],"longitude":-115.2072382,"state":"NV","stars":4,"latitude":36.1305306,"attributes":{"Parking":{"garage":false,"street":false,"validated":false,"lot":true,"valet":false},"Accepts Credit Cards":true,"Price Range":1},"type":"business"}

Now the data looks like it’s loaded in the DB, so let’s query this through Drill.

The first step to query DB data from Drill is to configure a storage plugin from the Drill web UI.

Below is the sample storage plugin configuration. Note that the JSON tables in MapR Database are just another storage format on the MapR Distributed File and Object Store, so you simply create a format plugin in Drill as shown below.

"yelp_maprdb": {
     "location": "/maprdb/json/yelp",
     "writable": true,
     "defaultInputFormat": "maprdb"
   },
.

Let’s check the structure of the business table using a simple Drill query.

**0: jdbc:drill:drillbit=10.10.103.32> SELECT * FROM mfs.yelp_maprdb.business LIMIT 1;**
+-----+------------+-------------+------------+------+--------------+-------+----------+-----------+------+---------------+------+--------------+-------+-------+------+
| _id | attributes | business_id | categories | city | full_address | hours | latitude | longitude | name | neighborhoods | open | review_count | stars | state | type |
+-----+------------+-------------+------------+------+--------------+-------+----------+-----------+------+---------------+------+--------------+-------+-------+------+
**|** --1emggGHgoG6ipd_RMb-g **|** {"Accepts Credit Cards":true,"Parking":{"garage":false,"lot":true,"street":false,"valet":false,"validated":false},"Price Range":1.0,"Ambience":{},"Good For":{},"Music":{},"Hair Types Specialized In":{},"Payment Types":{},"Dietary Restrictions":{}} **|** --1emggGHgoG6ipd_RMb-g **|** ["Food","Convenience Stores"] **|** Las Vegas **|** 3280 S Decatur Blvd Westside
Las Vegas, NV 89102 **|** {"Friday":{},"Monday":{},"Saturday":{},"Sunday":{},"Thursday":{},"Tuesday":{},"Wednesday":{}} **|** 36.1305306 **|** -115.2072382 **|** Sinclair **|** ["Westside"] **|** true **|** 4.0 **|** 4.0 **|** NV **|** business **|**
**+-----+------------+-------------+------------+------+--------------+-------+----------+-----------+------+---------------+------+--------------+-------+-------+------+**

Let’s see how many records there are in the business table.

**0: jdbc:drill:drillbit=10.10.103.32> SELECT count(*) FROM mfs.yelp_maprdb.business;**
+---------+
| EXPR$0  |
+---------+
**|** 42153  ** |**
**+---------+**

Drill allows you to use its full ANSI SQL power directly on MapR Database JSON tables, including all operators and functions.

Here is an example of using a SQL window function on the JSON tables.

**0: jdbc:drill:drillbit=10.10.103.32> WITH X**
**. . . . . . . . . . . . . . . . . .> AS**
**. . . . . . . . . . . . . . . . . .> (SELECT name, city, review_count,**
**. . . . . . . . . . . . . . . . . .> RANK() OVER (PARTITION BY city ORDER BY review_count DESC)**
**                                    AS review_rank**
**. . . . . . . . . . . . . . . . . .> FROM mfs.yelp_maprdb.business)**
**. . . . . . . . . . . . . . . . . .> SELECT X.name, X.city, X.review_count**
**. . . . . . . . . . . . . . . . . .> FROM X**
**. . . . . . . . . . . . . . . . . .> WHERE X.review_rank =1 ORDER BY review_count DESC LIMIT 10;**
+-------------------------------------------+-------------+---------------+
|                   name                                     |    city     | review_count  |
+-------------------------------------------+-------------+---------------+
**|** Mon Ami Gabi                                          **|** Las Vegas   **|** 4084.0       ** |**
**|** Studio B                                                    **|** Henderson  **|** 1336.0       ** | **
**|** Phoenix Sky Harbor International Airport **|** Phoenix       **|** 1325.0       ** |**
**|** Four Peaks Brewing Co                            **|** Tempe         **|** 1110.0       ** |**
**|** The Mission                                               **|** Scottsdale   **|** 783.0        ** |**
**|** Joe's Farm Grill                                         **|** Gilbert          **|** 770.0        ** |**
**|** The Old Fashioned                                    **|** Madison       **|** 619.0        ** |**
**|** Cornish Pasty Company                            **|** Mesa           **|** 578.0        ** |**
**|** SanTan Brewing Company                        **|** Chandler      **|** 469.0        ** |**
**|** Yard House                                                **|** Glendale      **|** 321.0        ** |**
**+-------------------------------------------+-------------+---------------+**

You can query nested JSON data directly from DB without flattening it, and you can use Drill’s power of complex data capabilities and all ANSI SQL extensions.

**0: jdbc:drill:drillbit=10.10.103.32> SELECT name, stars, b.hours.Friday friday, categories**
**. . . . . . . . . . . . . . . . . .>   FROM mfs.yelp_maprdb.business b**
**. . . . . . . . . . . . . . . . . .>   WHERE b.hours.Friday.`open` < '22:00' AND**
**. . . . . . . . . . . . . . . . . .>                  b.hours.Friday.`close` > '22:00' AND**
**. . . . . . . . . . . . . . . . . .>                  city = 'Las Vegas'**
**. . . . . . . . . . . . . . . . . .>   ORDER BY stars DESC**
**. . . . . . . . . . . . . . . . . .>   limit 5;**
+------------------------------------+--------+-----------------------------------+------------------------------------------------------------------------------------------+
|                name                | stars  |              friday               |                                        categories                                        |
+------------------------------------+--------+-----------------------------------+------------------------------------------------------------------------------------------+
**|** Peachy Keen Unions By Angie Kelly **|** 5.0   **|** {"close":"23:30","open":"06:00"} **|** ["Officiants","Wedding Planning","Event Planning & Services"]                           ** |**
**|** Concierge Makeup                  **|** 5.0   **|** {"close":"23:30","open":"00:30"} **|** ["Makeup Artists","Beauty & Spas"]                                                      ** |**
**|** Hermès Encore                     **|** 5.0   **|** {"close":"23:00","open":"10:00"} **|** ["Shopping"]                                                                            ** |**
**|** Hague Quality Water               **|** 5.0   **|** {"close":"23:00","open":"06:00"} **|** ["Plumbing","Home Services","Contractors"]                                              ** |**
**|** The Sci-Fi Center                 **|** 5.0   **|** {"close":"23:00","open":"11:00"} **|** ["Comic Books","Arts & Entertainment","Cinema","Shopping","Books, Mags, Music & Video"] ** |**
**+---------------------------****---------+--------+-----------------------------------+------------------------------------------------------------------------------------------+**

A few of my earlier blogs covered how Drill works with JSON, and all these queries can be now tried on MapR Database directly. Drill does a variety of optimizations (such as projection pushdown, filter pushdown) to make sure the queries are faster. This is a significant performance benefit of storing data in MapR Database compared to files.

Here is a quick example on how pushdown works. Let’s take the same nested data query that has a number of filter predicates.

**SELECT** **name, stars, b.hours.Friday friday, categories** **FROM mfs.yelp_maprdb.business b**
**WHERE** **b.hours.Friday.`open` < '22:00' AND b.hours.Friday.`close` > '22:00' AND city = 'Las Vegas'** **ORDER BY stars DESC LIMIT  5**

To see how Drill optimizes the query, you can look into the EXPLAIN PLAN. As highlighted below, Drill does projection pushdown and retries only name, starts, hours, and categories columns from DB rather than all columns. It also does filter pushdown by passing all the filters to the underlying DB, rather than bringing it to Drill and processing it.

There are many additional exciting features in Drill 1.6. Download the MapR release and try it out!

How to get started with Drill:

For full documentation, please refer to http://drill.apache.org/docs. Additional resources can be found at https://mapr.com/apachedrill

If you have any additional questions about Drill 1.6, please ask them in the comments section below.


This blog post was published April 06, 2016.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now