5 min read
Big Data analysis is intimidating to some people. They assume you need a background in statistics, deep technical knowledge, and other complex skills. But you don’t need to be a data scientist to extract insights and value from Big Data with Hadoop and Apache Drill.
Drill uses standard ANSI SQL to query a variety of structured and semi-structured data types. Users can therefore leverage their existing SQL skills, and preferred business intelligence (BI) tools, to run queries. And since Drill can combine data from multiple sources on the fly in a single query, even a non-expert user can dig deep into big data. Drill can run on a laptop and scale from a single node up to a large cluster of servers, enabling data querying in its native formats. Plus, there’s no need to define schemas, which allows for self-service data exploration.
This walkthrough is intended for those who have limited or no experience with Drill. It assumes some familiarity with running SQL queries, however. The following information can be utilized in corporate training programs or distributed to users to help them build confidence in working with Drill.
This is the first of a three-part series. Each article in the series will build on the basic skill sets detailed here. The material below shows how to use Drill to perform familiar SQL queries on structured data.
The two subsequent guides will show how to use Drill to:
The Drill Sandbox and Material
Even if Drill is already deployed on your network, you may find that users (and IT staff) are more comfortable if beginners learn Drill in a sandbox environment. The easiest way to do this is with MapR Sandbox with Apache Drill, a free and fully functional single-node cluster.
To access the materials and data to follow along with the steps in this blog, register for the Drill Essentials course and download the content from Course Materials sheet. You will also find a link to download the Drill Sandbox in there.
For the purposes of this walkthrough, we are all employees of The Big Office Supply Company, a large retail business. The Big Office Supply Company uses data analysis to better understand their customers’ activities and purchasing habits.
The Big Office Supply Company marketing department wants to begin a new sales promotion. We’re going to help them decide when and where to focus their promotional efforts. To do this, we will determine:
Preparing to Analyze Data
Our IT Department has offloaded its customer data from a transactional Oracle system into a Hive table on a Hadoop cluster.
This table, called “orders,” includes columns for the order ID number, the month the order was placed, the customer ID of the person who placed the order, the country where the customer lives, the product ID of the order, and the total amount of the order.
We’ll use Drill to query this table to find the information the marketing department needs.
Querying with Drill
As the slide above shows, you begin with Query 1 to determine the top month of gross sales.
month, SUM(order_total) as sales
Query 2 provides the sales total in the specified month, broken down by state.
state, SUM(order_total) as sales
Query 3 determines the top 10 products based on volume of sales.
prod_id, SUM(order_total) as sales
We now know what products are most popular, and when customers in which states are most likely to make purchases. This information will help The Big Office Supply Company marketing department determine which products to promote at what time of year, and enable them to target the campaign to specific states.
As seen here, running queries on Drill is extremely easy and intuitive for a SQL user. Drill also works with BI tools if you prefer to use it that way.
Stay tuned for part 2 of this series.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.