Apache Drill 301: Discover Schema on the Fly

Contributed by

7 min read

This is the third and final entry in our three-part series focused on building basic skill sets for use in data analysis.

The series is aimed at those who have some familiarity with using SQL to query data but limited or no experience with Apache Drill. The following information can be used in corporate training programs to help users better understand how to work with Drill. Even those who will not be directly querying data can benefit from this material, as it can help them to work more productively with data analysis team members.

In Part 1, we walked through a basic process of querying data that enables any user with basic SQL skills to work with Drill. In Part 2, we focused on more complex queries and working with semi-structured data. In Part 3, below, we look at how Drill discovers the schema of data on the fly.

The Drill Sandbox

Even if Drill is already deployed on your network, you may find that users (and IT staff) are more comfortable if beginners learn Drill in a sandbox environment. The easiest way to do this is with MapR Sandbox with Apache Drill, a free and fully functional single-node cluster.

To access the materials and data to follow along with the steps in this blog, register (if you have not already done so for part 1 of this series) for the Drill Essentials course and download the content from Course Materials sheet. You will also find a link to download the Drill Sandbox in there.

Synopsis of Parts 1 and 2

We are, for the purposes of this walkthrough, employees of The Big Office Supply Company, a large retail business.

In Part 1, we supported our marketing department in a new sales promotion by analyzing data to help them decide when and where to focus their promotional efforts. We determined the top grossing month of sales, the rank of countries based on their gross sales in that month, and the top 10 products based on total sales volume.

In Part 2, The Big Office Supply Company’s data analysts decided that the marketing team could benefit from more in-depth information. They worked with search, click, and log data collected from our company website to determine what products people showed the most interest in, who looked at but did not purchase a product, and who looked at and did purchase a product.

Those who looked and did not buy will receive a special discounted offer for the product they were interested in. Those who did make a purchase will receive an offer to upgrade or purchase consumables for their products.

Our data analysts have done a great job using Drill to easily query a variety of data sources to help us target our promotions to get a much higher rate of return for our marketing dollars. Yet we have just begun to learn how Drill can be used to access our company’s data.

Discovering Data Schema on the Fly

SQL on hadoop

In addition to writing SQL queries with the command line, Drill provides full JDBC/ODBC access to standard business intelligence (BI) tools such as Tableau, Microstrategy, and SAS. The ODBC driver can be used to access a graphical user interface (GUI) called Drill Explorer.

sql on hadoop 2

Drill Explorer can be used by our data analysts to examine and understand the metadata available in the schema-less and semi-structured data before they query it with visualization tools.

sql on hadoop c

When we first open Drill Explorer, we see all of the data sources that are available to Drill. This includes our structured HBase table, as well as our unstructured JSON and parquet data, and any other data that Drill can access.

We can give Drill access to multiple data sources; structured, unstructured, known, or even unknown.

As noted in Part 2 of the series, data comes in different formats:

  • Structured data is neatly organized in tables, further broken down into columns and rows, in a relational database.
  • Unstructured data are files such as photos, graphics, videos, webpages, and PDFs that are difficult to organize in a structured format.
  • Semi-structured data is somewhere in between. JSON, Parquet fall under this category. Metadata tags are typically used to identify the data, but it isn’t organized in a rigid structure. This a common format for big data.

Some data query tools can’t work with semi-structured data, as they rely on the structural organization provided by a schema to locate information. To query semi-structured data, analysts would need to enlist a data engineer to help them write a MapReduce program or perform a lengthy ETL process to flatten the data and create a centralized schema for it.

Drill semi-structured data 4

Drill does not need to know the schema—how the data is organized—or even the specific type of data it is working with. Our analysts can freely explore all of these different data sources without waiting for a data engineer to prepare the data for analysis, or even typing a single line of SQL.

They can quickly and easily view the structure, size, and contents of each of these data sources.

semi-structured data

Simply selecting any one of these data sources causes the associated data to appear in the browser window.

semi-structured data with drill

If they find something of interest, Drill will provide the SQL code for that section of data. And as they change what they look at in the browser, the SQL is automatically updated. They can also edit the SQL as needed

  • Switching to the SQL tab allows our analysts to view the SQL query associated with the selected data.
  • They can modify the SQL and see how it affects the displayed data.

drill and semi-structured data

With Drill Explorer, our data analysts can get a better understanding of our data through quick exploration. Drill connects to our data using the ODBC driver. When our analysts find interesting data, they can simply create a view and then easily visualize their areas of interest with the BI tools they already use.

Since Drill connects to our data using the ODBC driver, when our analysts find interesting data, they can simply create a view and then easily visualize their areas of interest with the BI tools they already use.

This blog post was published May 26, 2015.