7 min read
This blog post will walk through the installation and basic usage of the jupyter_drill module for Python that allows you, from a Jupyter Notebook, to connect and work with data from Apache Drill using IPython magic functions. If you are looking for the design goals of the project, please see my other blog post Mining the Data Universe: Sending a Drill to Jupyter about how this module came to be and the design considerations I used while building this module.
This guide assumes you have a Jupyter Notebook server you can work with. It also assumes that you have a Drill cluster, which you can connect to, already running. Getting a Drillbit or cluster running is beyond the scope of this tutorial, but I recommend you check out the Apache Drill documentation here: https://drill.apache.org/docs/. Following these steps will make the jupyter_drill module available to your notebooks.
There are two optional things to consider when starting your Jupyter notebook server: the ENV variables and the startup scripts. Both can make for a better and easier user experience.
When you start your notebook server, there are two environmental variables you can set to make life easier for you or your users:
These two variables are both optional but save time when connecting to Drill. Essentially, the JPY_USER is the username the jupyter_drill module will default to, and DRILL_BASE_URL is the URL of the Drill Rest API connection to use.
If unset, both items will be prompted for at connect time. In addition, if you set these variables, there is an option to connect to a Drill server with an alternative context if required; instead of %drill connect, just type %drill connect alt, and it will prompt you!
Located in your ~/.ipython/profile_default/startup folder, you have the option to place startup Python scripts into files here. Many people order them numerically (00-firstscript.py, 01-secondscript.py), so they run in the order intended.
Using this feature of Jupyter Notebooks, we can make our notebooks cleaner, and the %drill magic function will work on every notebook! To do this, create a file, such as:
In this file, place the following text inside it:
# Begin code block from drill_core import Drill ipy = get_ipython() Drill = Drill(ipy) # or, if you want to set options like use beakerX on without user intervention, try: # Other options can be set at startup here as well! #Drill = Drill(ipy, pd_use_beaker=True) ipy.register_magics(Drill) # End code block
Save the file, and you will be ready to go.
This step is 100% optional; it just saves the folks running your notebooks the steps of having to put that block of code at the top of every notebook they create in order to user the %drill magic function.
Okay, start your notebook server! Open a new notebook on your server, where you have installed jupyter_drill. If you did not follow the optional Startup Script recommendation above, the first thing you will need to do is register the jupyter_drill %drill magic function, if it’s not part of your startup scripts:
Next, let’s run %drill, and see our base help screen:
Pretty basic! The next thing we need to do is make the connection to our Drill server/cluster:
When you type %drill connect, it prompts you for a password. The user and Drill URL, for me, came from the ENV variables listed above. I typed my password, and Drill was connected! Note, the password is NOT actually stored in your notebook. This allows you to share your notebook without fear of compromising your Drill password.
Now, if you did not get an error, you should be connected. (If you did get an error, please feel free to post an issue on https://github.com/johnomernik/jupyter_drill, and I will be happy to help.) You can see the drill status by running %drill status.
At this point, you can start using Drill! Here are a couple of queries to use a workspace called dfs.prod and show the tables I have:
And now, the "big deal" – running a real data query! (This is weather data from my Personal Weather Station [PWS].)
And look, the results from above are also in the Pandas DataFrame variable prev_drill!
This variable is always overwritten, every time you run a %%drill query. Thus, if you want to save some results for manipulation, assign prev_drill to a new variable.
So here I assigned prev_drill with the above results to the variable myvar and ran another %%drill query. Now, prev_drill has the results of the show tables query, and myvar has the previous results. Handy.
Finally, and you don’t have to do this, there is an option where you can disconnect Drill:
This is a simple demonstration of the jupyter_drill module in action. It allows for flexible connections, using the data both in a visual form AND a programmatic form, and the ability to customize your connections to Drill. There are lots of improvements I can make here, but this is my initial Proof of Concept for the community. Please take a look at the repository, and feel free to post issues if you find them. Once again, to get an idea of my motivation and goals in writing this blog, please see my other blog post Mining the Data Universe: Sending a Drill to Jupyter. It also has some more advanced option explanations as well as the "to-do" items I’d love to see brought to this module. My goal is to have this be a robust tool that data scientists can use to interface with Apache Drill!
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.