ETL and Interactive Analytics with Apache Spark and Apache Drill – Whiteboard Walkthrough

Contributed by

7 min read

In this week’s Whiteboard Walkthrough, Vinay Bhat, Solution Architect at MapR Technologies, takes you step-by-step through a widespread big data use case: data warehouse offload and building an interactive analytics application using Apache Spark and Apache Drill. Vinay explains how the MapR Data Platform provides unique capabilities to make this process easy and efficient, including support for multi-tenancy.

Additional resources:

Here's the unedited transcription:

Hi, my name is Vinay Bhat, and I'm a Solution Architect at MapR. Data warehouse offload is in the most common use case we see in the industry. We see very high adoption rates as well as high ROI for our customers who are building data warehouse offloads with big data platforms. Today, we'll walk through the building blocks of building an analytics application with Apache Spark and Apache Drill.

Let's start with your data sources. The most common data sources you have are your ERP data, your CRM, your transaction or POS data, and third party information that you bring. Also, probably loyalty information, membership information, and so on. We provide a very easy way to ingest that data into the MapR Data Platform. You can use NFS capabilities, which makes ingestion extremely easy. You can also use all the open source tools like Apache Scoop and Apache Flume. We provide additional tools like MapR Event Store to ingest the data into the MapR Data Platform.

Once the data is ingested, now the need is to make sure data is harmonized, and data is massaged. You transform that, you aggregate it, you calculate business metrics, and you apply business rules. To build this workflow, which is generally called an ETL workflow, we use Apache Spark. Apache Spark is a very powerful tool; we see very high adoption and high success rates in building ETL workflows with Apache Spark. Once the data is in your data lake or in Hadoop or whatever you want to call it, now data is sized and persisted in the MapR Data Platform.

Let's talk a little bit about your analytics application. The most common requirement that we see is that your analytics application needs to provide interactive analytics capabilities for the end users. With the interactive analysis, performance is extremely important. You are building dashboards, and you are building ad-hoc queries, which means end users who are sitting in front of the computer need to get the data extremely fast. Also, now, big data platforms by nature are multi-tenant. There are multiple tenants using this big data platform. It's extremely important that we provide those multi-tenancy capabilities from the ground up so you have data isolation, more controls for building more secure access to the data, as well as well as providing the ability to build separate views on top so you can control what users can see what data.

In addition to that, it's even better if you can have a self-service capability so users can bring their own data,and then provide further insight by correlating the data they upload or they bring with the data that already is in the data platform. Let's walk you through how that is done.

We most commonly see people using columnar data formats like Parquet, so data is stored in a columnar format. In the old days, you would use an enterprise data warehouse, and then you have data marts which are built for a specific line of business where they take data from data warehouses and then build their own data models in the data marts, and build dashboards and operational reports. In a similar fashion, MapR provides what we call MapR volumes. It provides data isolation controls, as well as data policy controls, so you can really isolate data for different tenants and define their own data policies on each of those MapR volumes.

Now we store individual tables for that particular tenant in a MapR volume. Once that is stored, Apache Drill provides that distributed query engine with the ANSI SQL standards, so now users can query SQL data using a JDBC or ODBC connection. Apache Drill provides a highly distributed and highly performant query engine to query data that is already in Hadoop or in the MapR Data Platform.

I spoke a little bit about self-service capabilities. Now users can bring in their own data. Then once they upload the data, it resides on their own work area, by defining MapR volumes. This data that they uploaded is visible to only that tenant, and now this is accessible using Apache Drill, so they can run either dashboards or ad-hoc analysis or exploratory analysis on the data they uploaded along with the data that's already in the MapR Data Platform. We most commonly see people using Tableau or other BI tools, and in some cases customers build their own custom UI with d3gs and what have you. As you can see, this entire data flow is very modular in nature. You can extend this capability further by building a UDF or a workflow that can be deployed by a click of the button. So any new data that's coming in, or new custom files or third party files they want to upload could be easily uploaded, and they can build an ETL workflow and exploratory data analysis using Apache Spark and Apache Drill.

We'd love to hear from you. If you have any comments or feedback, put it in the comments section below. We have more resources available on, and also in the MapR Community.

This blog post was published August 17, 2016.