A few months ago, I created the first XML plugin for Apache Drill. The idea behind the plugin is simple: Since Apache Drill already has great support for JSON, why not convert the XML documents to JSON, and feed the information into the JSON driver for further processing and presentation in Apache Drill? In this blog post, I’ll show you how I accomplished this.
I already had a SAX-based XML to JSON parser that I'd written for a demo, so geared with the source code for Apache Drill, I set out to try my ideas. One hour later, I had the first implementation that hooked into Drill the way I wanted and compiled; it worked the first time! Extending Apache Drill is simple when reusing the base written by genius developers. I'm not a developer myself, so if I can do it, so can you.
My code was not 100% perfect, however; there were lots of errors becauseI had not thought about gearing the parser towards the format Apache Drill liked. Since it wasn't really useful, I forgot about the project and kept on with my daily work at MapR as a Systems Engineer.
A week ago, I was asked to test and see if the Drill plugin could do some magic with some specific XML documents for a customer. Since I had worked a lot with Apache Spark and Apache Spark XML since writing the plugin for Apache Drill, I had some new ideas to bring into the code. For instance, keeping attributes with the @sign, and keeping values in tags as #value. Once I looked at the code and saw how Drill reacted towards my generated JSON, I decided to rewrite the XML plugin to generate better JSON and to be able to support more XML documents. Now I can safely say that the investment was well worth the effort.
Based on my own tests, it can query "almost" any XML files and get a workable JSON document back that Drill understands and can work with. So far, I have tested it with network data, pos data, some European Union Data, Excel XML sheet data, and database logs in XML, and even the mondial data sample in the form of XML— and all of them work and can be queried directly with Drill. There's a long way to go before it will be a central piece of Apache Drill, and I am sure the Drill engineers have more clever ways of solving many of the tasks I have dealt with in the code. However, it shows some of the capabilities that may soon be part of Apache Drill.
Since I did some small modifications of the JsonReader to be able to hook in my code, you need to run my Apache Drill version in order for it to work. The code can be found here::
Compile the project using mvn (you may have to bump up the memory with MAVEN_OPTS in order for the compilation to go through):
mvn clean package -DskipTests
Once successfully compiled, move the contrib/storage-xml/target/drill-xml-storage-1.7.0-SNAPSHOT.jar into the jars/3rdparty folder of the Apache Drill distribution you just built.
In order to configure XML support you just add:
to the formats section of your storage config for dfs and off you go. If the plugin was successfully registered, you will be able to update the storage config, and after that, you can query XML documents.
In this blog post, I summarized the steps needed to use the XML plugin for Apache Drill. If you have any questions about using this plugin, please ask them in the comments section below.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.