9 min read
Did you know you can run Apache Drill on your laptop? This is great news for business analysts who need to explore complex and semi-structured data. Let's look at a particular example.
A company has implemented a new SaaS based system. This system makes data extracts available over a RESTful API in JSON format. Before the data is loaded and standardised in the corporate data warehouse a business analyst gets tasked with exploring this new data set and a data extract is made available for analysis. A great tool for exploratory data analysis (EDA) is Tableau. Our business analyst immediately gets to work only to realise that querying JSON from Tableau (http://community.tableau.com/thread/147566) is not straight forward.
Drill to the Rescue
Our business analyst wonders what to do next. Should she involve the guys from IT to transform the data to something more easily digestible? This could take a week or more. Time is of the essence. There must be another way. After some more googling for a solution she comes across Drill. With Drill she can query JSON data using SQL, a skill she is deeply familiar with. Drill also ships an ODBC driver, which allows her to connect with Tableau.
She downloads Drill to her Windows laptop http://drill.apache.org/docs/installing-drill-on-windows/.
She checks that she has the Oracle 7 JDK installed.
She then proceeds to install Drill in embedded mode http://drill.apache.org/docs/installing-drill-on-windows/. She picks the root of her C:\ drive as the install destination.
Next she double checks that she has set the JAVA_HOME environment variable correctly.
She starts Drill http://drill.apache.org/docs/starting-drill-on-windows/.
This concludes the installation of Drill.
Next she downloads the Drill ODBC driver. (https://package.mapr.com/tools/MapR-ODBC/MapR_Drill/MapRDrill_odbc/).
Note: Always make sure that the version of Drill corresponds to the version of the ODBC driver.
She selects the 64 Bit driver as she also runs the 64 Bit version of Tableau. Once the driver has finished downloading she launches the installer.
… and waits for the install to complete.
Exploratory Data Analysis with Tableau
The business analyst is now ready to analyse the data dump. She downloads the weather data from the Open Weather http://openweathermap.org/current website. The JSON dataset we are interested in is the file http://188.8.131.52/sample/weather_14.json.gz that contains the current weather of 20,000 cities (updated hourly).
She extracts JSON file and copies it into her sample data folder in the Drill install folder C:\apache-drill-1.0.0\sample-data.
She opens the file in a text editor and copies one record into a JSON formatter.
http://jsonformatter.curiousconcept.com/ to get a better understanding of how the data is structured hierarchically.
We can see that the JSON document is split into various sections: city, time, main, wind, clouds, weather. We can also see that weather is modelled as an array in this JSON dataset. Weather is multivalued. Each city weather record may contain one or more descriptions. This discovery will become important later on, when we write queries against the data.
Next she wants to familiarise herself further with the data. Drill Explorer, a tool to visually explore Drill data ships with the Drill ODBC driver.
We can access Drill Explorer via the 64 bit ODBC Administrator in Windows. You can access the 64 bit ODBC driver via C:\WINDOWS\SysWOW64\odbcad32.exe.
Our business analyst moves to the System DSN tab and there clicks the Add… button.
Next she selects the MapR Drill ODBC driver:
She gives the new data source a name and…
…tests the connection by clicking the Test button:
She is now ready to launch Drill Explorer by clicking the Drill Explorer… button. She navigates to the JSON weather file in the sample data folder and then double clicks the weather_14.json file.
This throws an error:
At this stage our business analyst consults the documentation https://drill.apache.org/docs/json-data-model/ and finds the solution to fix this problem.
“By default, Drill does not support JSON lists of different types. For example, JSON does not enforce types or distinguish between integers and floating point values. When reading numerical values from a JSON file, Drill distinguishes integers from floating point numbers by the presence or lack of a decimal point. If some numbers in a JSON map or array appear with and without a decimal point, such as 0 and 0.0, Drill throws a schema change error. “
In the weather data set we have exactly this scenario. The field Pressure can be with or without decimal point.
The solution is to set the store.json.read_numbers_as_double property to true. In the Drill Explorer the BA switches to the SQL tab and issues the following command:
ALTER SYSTEM SET
store.json.read_numbers_as_double = true;
She confirms by clicking Preview:
Next she switches back to the Browse tab and double clicks the weather14.json file et voilá
The business analyst is now ready to further explore this data in Tableau. For her analysis she needs the following data points:
- Country Code
- City Name
- Geo Coordinates
- Temperature in Kelvin, Celcius, and Fahrenheit
- Weather Description
She creates a View over her dataset that will then be exposed to Tableau:
CREATE OR REPLACE VIEW dfs.tmp.city_weather
t.city['id'] AS city_id,
t.city['name'] AS city_name,
t.city['country'] AS country_cd,
t.city['coord']['lon'] AS city_lon,
t.city['coord']['lat'] AS city_lat,
time) AS datetime,
CAST(t.main['temp'] AS INTEGER) AS temp_k,
CAST(t.main['temp'] - 273.15 AS INTEGER) AS temp_c,
CAST((t.main['temp'] - 273.15) * 1.8 + 32 AS INTEGER) AS temp_f,
t.main['humidity'] AS humidity,
t.main['pressure'] AS pressure,
t.weather['description'] AS weather_desc_1,
t.weather['description'] AS weather_desc_2
There are a couple of things worthwhile noting:
timeso that it can be interpreted correctly.
Drill and Tableau a match made in heaven
Our business analyst finally is ready to run queries against the JSON weather dataset in Tableau.
She launches Tableau and selects Other Databases to connect to Drill.
Next she selects the Data Source Name (DSN) that she created earlier on through 64 bit ODBC Administrator.
Next she selects the Schema dfs.tmp and drags the city_weather view across.
In a last step our business analysts creates a map with world temperatures…
…and explores a particular region in more detail:
What’s NextOf course, Drill is not only drilliant to query data on a single laptop. It can be deployed in a clustered environment to query large volumes of data at scale with low latency and high concurrency.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.