8 min read
Rarely do enterprise data science applications operate on data entirely contained within a single database system. Take for instance a company that wants to build a Customer 360 application using data sources across its enterprise to develop marketing campaigns or recommendation engines that more accurately target the concerns of key customer groups. To develop this kind of application, you would be faced with a set of heterogeneous datasets such as CRM data, customer support data, and payment records that need to be migrated and transformed into a common format simply to perform analysis. This can be challenging when your datasets are large, their schemas change, or they're stored as schemaless CSV or JSON files.
The solution to this challenge is to use a database technology such as Apache Drill that can quickly process SQL queries across disparate datasets without moving or transforming those datasets. In this article, I illustrate how Apache Drill can be used to query heterogeneous datasets easily and quickly.
Apache Drill is a Unix service that unifies access to data across a variety of data formats and sources. It can run on a single node or on multiple nodes in a clustered environment. It supports a variety of NoSQL databases and file systems, including HBase, MongoDB, HDFS, Amazon S3, Azure Blob Storage, and local text files like JSON or CSV files. Drill provides the following user friendly features:
Drill supports industry standard APIs such as ANSI SQL and JDBC.
Drill enables you to perform queries without predefining schemas.
Drill enables you to join data in different formats from multiple datastores.
Drill is primarily designed to work with non-relational datasets, but it can also work with any relational datastore through a JDBC driver. To setup a JDBC connection follow these three steps:
I'm using a three-node MapR cluster for my development environment, so I already have Drill installed on each node. Drill is included in the standard MapR distribution, although MapR does not officially support the RDBMS plugin.
The MySQL JDBC connector can be downloaded from https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.41.tar.gz. So I copied that to /opt/mapr/drill/drill-1.9.0/jars/3rdparty/ on each of my nodes.
I defined the MySQL storage configuration from the Drill web UI as show below:
If you see the error "Please retry: error (unable to create/ update storage)" when you try to save the storage plugin, you've probably specified an invalid URL for the MySQL service, the credentials are invalid, or something else is preventing the Drill Unix service from connecting to your MySQL service.
After you click the button to enable that plugin, you should see your MySQL tables by running show schemas in Drill as shown below:
Here are a couple of examples showing how to join JSON files with MySQL tables using a SQL join in Drill:
In order to use Drill from Python, R, or any other programming language, you have to install an ODBC driver. The official Apache Drill docs describe how to install the Dron CentOS or Red Hat Linux, but they do not cover Ubuntu, so I will. Here's how to install the MapR ODBC driver on Ubuntu 14.04.
First download and install the latest MapR ODBC rpm, like this:
Then update those .ini files accordingly. Here is how I setup my .ini files to connect to Drill.
Now, you need to install the Python ODBC library. See https://github.com/mkleehammer/pyodbc/wiki/Install for thorough instructions. Here's the gist of how you install pyodbc on Ubuntu and CentOS.
Finally, if you've installed and configured the ODBC driver correctly, then the command shown below will output information following that:
Drill Explorer is the desktop GUI for Linux, Mac OS, and Windows. It's useful for browsing data sources and previewing the results of SQL queries. People commonly use it to familiarize themselves with data sources and prototype SQL queries, then use another tool for actually analyzing that data in production.
Download and install Drill Explorer and the iODBC driver for Mac OS from here. Follow these instructions to setup the ODBC driver:
Now you should have applications available in /Applications/iODBC/ and /Applications/DrillExplorer.app/. Before you can connect Drill Explorer to Drill, create a new connection from the iODBC Data Source Administrator. Instructions for configuring ODBC connections can be found at https://drill.apache.org/docs/testing-the-odbc-connection. Here's what my configuration looks like:
Once you connect Drill Explorer using the iODBC configuration you created, you should see all the available data sources and preview the results of SQL queries as shown below:
You can also use the ODBC connection you configured above to programmatically query Drill data sources. MapR blogged about how to use Drill from Python, R, and Perl. I used those instructions to setup an ODBC connection in a Jupyter notebook for data exploration with Python as shown below:
You can also connect to Drill from R Studio using the ODBC connection you configured above. The following shows how I set that up.
Download and install the RODBC driver:
Set your LD_LIBRARY_PATH:
Open R Studio and submit a SQL query:
For more information on using Drill with R, check out the following references:
Dealing with unicode in the python ODBC driver has caused me a lot of frustration. Fortunately, specifying unicode decoding options as shown below seems to help, though not for anaconda, which doesn't seem to like the "to=str" option – so I just don't use anaconda.
That should output:
If you don't set the unicode options shown above, you'll see this instead:
You can convert that to ASCII with something like:
but I find that really confusing. Setting the following unicode options for the ODBC connection ensures ODBC returns ASCII column names and dataframe values instead of unicode:
I like to test Python's access to Drill via the ODBC driver with a command like this:
If it works, you should see something like this:
Often I'll get this error:
If you get that error, run this command to figure out what file is not found:
If you see this:
then you need to add /opt/mapr/drillodbc/lib/64 to your LD_LIBRARY_PATH, like this:
The ODBC driver seems to return column names and table contents as unicode.
What I've tried to show is it doesn't matter how your data is formatted or where it's stored, Apache Drill enables you to easily query and combine datasets from a wide variety of databases and file systems. This makes Apache Drill an extremely valuable tool for applications such as Customer 360 that require access to disparate datasets and the ability to combine those datasets with low latency SQL joins.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.