7 min read
This is the second in our three-part series focused on building basic skill sets for use in data analysis. The material is intended for those who have no prior, or very limited, experience with Apache Drill, but do have some familiarity with running SQL queries.
The following information can be utilized in corporate training programs, or provided as handouts to users, to help them better understand the possibilities provided by Drill. In Part 1, we walked through a basic process of querying data that is accessible to any user with basic SQL skills. In Part 2, we break down the steps involved in running a more complex query, working with semi-structured data.
Even if users are not expected to perform more sophisticated queries independently, this material will help them to understand the process, spark ideas about how data can be used, and enable them to have more productive conversations with their data analysis team members.
The Drill Sandbox
Even if Drill is already deployed on your network, you may find that users (and IT staff) are more comfortable if beginners learn Drill in a sandbox environment. The easiest way to do this is with MapR Sandbox with Apache Drill, a free and fully functional single-node cluster.
To access the materials and data to follow along with the steps in this blog, register (if you have not already done so for part 1 of this series) for the Drill Essentials course and download the content from Course Materials sheet. You will also find a link to download the Drill Sandbox in there.
Synopsis of Part 1: Data Analysis for (Almost) Everyone
We are all, for the purposes of this walkthrough, employees of The Big Office Supply Company, a large retail business. Our marketing department wants to begin a new sales promotion. We analyzed data to help them decide when and where to focus their promotional efforts. We have determined the top grossing month of sales, the rank of countries based on their gross sales in that month, and the top 10 products based on total sales volume.
Digging Deeper with Drill
Our marketing department is happy to know the optimum times and locations to launch a product promotion. But our data analysts feel that customer purchase information is, by itself, not enough information to allow The Big Office Supply Company to confidently launch a marketing campaign.
We’re going to look at the data we’ve collected regarding customers who visit our website, to gain a better understanding of who they are and what they want.
When people view product pages on our website, we collect click data in a nested JSON format and log files as flat text.
We also collect all user product search requests, which are regularly converted into parquet format.
Analyzing the Data
Our data analysts want to query all of the files mentioned above for additional insights that will enhance the effectiveness of our marketing promotion.
Additionally, our data analysts take a look at what products people are purchasing.
When coupled with information from our products database, we can provide targeted coupons for upgrades or consumables associated with those products.
How Drill Simplifies Querying Semi-Structured Data
Usually, analysts would not be able to use SQL to work with these types of semi-structured files.
Most data query tools don’t work well with semi-structured data, relying on the structural organization provided by a schema to locate information. To query semi-structured data, an analyst would need to enlist a data engineer to help them write a MapReduce program, or perform a lengthy ETL process to flatten the data and create a centralized schema for it.
Drill makes querying semi-structured data extremely easy.
Part 3 of this three-part series will explore how Drill discovers the schema of data on the fly.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.