Mining the Data Universe: Sending a Drill to Jupyter

Contributed by

11 min read

Elon Musk thinks big in his energy plans, in his space exploration plans, and in his terrestrial transportation plans. Those plans have included sending a first-generation Tesla Roadster into space with some Bowie playing on the radio. Despite the title, my current aspirations are a bit more terrestrial in nature: I want to bring Apache Drill to users of Python-based Jupyter Notebooks.

This is actually very straightforward: blog posts such as https://blog.ouseful.info/tag/apache-drill/ and posts like https://community.mapr.com/docs/DOC-1856-how-to-query-drill-in-python-from-jupyter-notebook show how users running Jupyter Notebooks can easily query Drill, using tools such as pydrill in the code. In fact, the Rest API of Drill makes it easy for Python coders to import requests, push up their sleeves, and work with Apache Drill to their heart's desire.

That said, when thinking about an implementation of Jupyter and Drill in an enterprise, I came up with a number of challenges that really bothered me. Although some of these seem minor, minor inconveniences can really make the life of a data scientist or administrator charged with maintaining a platform frustrating.

Some Challenges

  • When I connected to secured versions of Drill, using notebooks presents a unique problem: the connection string contains the password to Drill. This includes Rest API calls. Passwords to access, embedded in a sharable medium like a notebook, are inherently problematic. Even if the policy is to remove passwords before sharing, practice is littered with examples of forgetting to do so. Data without accountability is data at risk, and therefore I wanted an eloquent solution to handle this issue.

  • Thus far, working with Drill programmatically is fairly simple:

    query = "select * from table"
    results = drill.execute(query)

    Two lines, and, boom, you have results. But then a few more lines to explore them, a few more lines to visualize, etc. I know, splitting hairs, but if I have to work with all this "extra" code for every query, that's going to frustrate me. Also, I don't want to properly quote my queries; I just want to write sql. I don't want to iterate through a variable to see my results; I just want to see them. I don't want to assign them to Pandas using the proper data frame function; I just want it to be available for me to work with. So how can I make this happen?

  • How are sessions handled in long running notebooks? Do we have an idea of sessions, so we can set sys.options? What about using specific databases? Is this something the user has to handle in their Python code?

  • Repeated code: I wanted code that I could install into my notebook server or Docker image, so that users did NOT have remember how to "invoke" Drill. This should be something that when I implement it, it works at an enterprise level, is secure, and provides an easy interface for users. It also provides a cleaner interface in Jupyter. Less interface code, more data code.

  • As a security professional, I want the ability to do things correctly. That includes authentication and encryption. To that end, this needs to be able to handle certificates on Drill nodes.

  • I wanted this to be a tool not only for Drill-based data exploration but also something that a data scientist could use to pull results from Drill and manipulate with a common tool like Pandas DataFrames. This helps with built-in visualizations and also uses tools data scientists are already familiar with.

Given those guidelines, I set out to build a Python module that could do all that and more. So here, I am introducing https://github.com/JohnOmernik/jupyter_drill, a module for interacting with Drill from Python-based Jupyter notebooks. In another post, I will walk through installation and basic usage, but this post is focused on the architecture, limitations, and to-do list.

Architecture

The module is a simple Python module that includes a class that can be instantiated as iPython Magic function (%drill) in our case. The magic function:

  • Uses the Python requests module and the Drill Rest API to run queries and maintain sessions. It also includes some debugging tools for SSL connections with the intent that if Drill is secured with SSL, then this tool should respect that.

  • Maintains password security, even if notebooks are shared. While connections to Drill from a notebook will prompt for a password, that prompt does NOT store the password in the notebook or display it on the screen. Thus, you can create a notebook, do analysis, save the notebook, and share that saved notebook with another user. You will each connect with your own usernames, and if both of you have access to the data, the other user will be able to replicate your work without sharing credentials.

Here is a quick example of how that works: first, we load the module. (This can be done in startup scripts, which is explained in my other blog Drilling Jupyter: Visualizing Data by Connecting Jupyter Notebooks and Apache Drill)

Then, we connect.

The password is not stored in the notebook, but is instead only in memory!

  • Uses the built-in Pandas Table Display for displaying results. IPython has a way to display Pandas DataFrames in tabular format. That said, there is room for improvement when it comes to display, and to that end, installing BeakerX (instructions in next post) provides a rich interactive table that is automatically displayed if BeakerX is installed!

  • Returns data both to the screen and to Pandas DataFrame automatically. Since Pandas is a very common choice in the Jupyter world and displays well in notebooks, I made the decision to take all results and force them to a DataFrame. It will only display the results if the number of results is less then the max_display variable.

The results are stored (overwritten on each query) in a Pandas DataFrame variable called prev_drill.

You can "save" them by assigning them to a new variable (here, I used "myvar") and then running a new query to show prev_drill being overwritten.

The variable prev_drill is now overwritten, but the myvar remains the same!

Limitations

Since this is just a proof of concept, there are some quirky things and limitations in the setup. This is not intended to be a comprehensive list, but instead to show that I know where some improvements could be made and also to open myself up to ideas from the community.

  • Testing. This has only been tested in my current cluster, which is Drill running in distributed mode with Authentication and SSL enabled. To really be a universal setup, we need to do some testing with a single server embedded mode, without authentication and without SSL.

  • In my current setup, I have my Drillbits behind a DNS-based load balancer. This works well for most things; however, in this setup, I found that when DNS returned a different IP on later requests, I lost all my session information (because sessions are not shared between Drillbits). This was frustrating. I did include some hacking work around it that helped me:

    • drill_pin_to_ip: This does a DNS lookup and uses the IP as the server for the session. It works great, because the IP address that Drill connects to does not change; however, it can make it difficult if you use a trusted SSL certificate that expects the name of the host, not the IP.

    • In setting up the pinning to IP, I was able to use an add-on for the requests package called requests-toolkit. This allows me to connect to an IP address but still use the host name for certificate SSL exchange. This option is drill_rewrite_host and can be set before connect.

    • Thus, I also included some "debugging" SSL features. One is to use a different CA bundle or not verify CA (drill_verify). Another is to ignore the SSL warning (drill_ignore_ssl_warnings). Both of these are not recommended if you can avoid it.

  • I wanted to keep notebooks from being bogged down by tons of results; thus, I used the pd_ variables to limit and control the displayed output. By default, it's limited to 1000 rows. Anything more than that will be returned to the Pandas DataFrame but not to the notebook.

To-Do List

Since this is my first "release," I have a number of to-do items that I want to see improved. I am hoping that by releasing this and sharing with the community, I can get feedback on making it easy to use, easy to customize, and easy to Drill your data! I already know I want to clean up the README.md and make the formatting of help screens within the module better, but what else would make a data scientist's life easier?

Conclusion

From an interface standpoint, my personal philosophy is that the interface should be as easy as possible. I think this PoC gets pretty close, but obviously, as I learn more about Jupyter and Drill itself, I may find ways to make this module even more intuitive and easy to use for data scientists. If all of this seems great to you, head over to my other blog post Drilling Jupyter: Visualizing Data by Connecting Jupyter Notebooks and Apache Drill and follow the instructions to install it! Please feel free to comment here, or pass on ideas/bug reports on the git repository to help improve this project.


This blog post was published August 24, 2018.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now