Now You Can Handle Large JSON-based Data Sets in Hadoop or Spark

Contributed by

4 min read

Handling large JSON-based data sets in Hadoop or Spark can be a project unto itself. Endless hours toiling away into obscurity with complicated transformations, extractions, handling the nuances of database connectors, and flattening ‘till the cows come home is the name of the game. The amount of “miscellaneous pipeline massaging” just to get the data into a place where you can start doing meaningful things with it in Python or Java can take weeks or months to get it right. In the early phases of POCs, many of us have seen big data projects forced into unnatural acts with a MySQL or SQLite database, or the overlaying of other databases onto their Hadoop cluster, just to show results quickly before thinking about what happens when things get Really Big.

On the other hand, this more quotidian phase of data science can’t be overlooked, as it can pay serious dividends later if done well, in the form of being able to quickly add data to later phases of the pipeline and more agility in responding to changes in the input. JSON can arrive at your cluster from lots of different places. It’s becoming one of the most widely used ways of representing a semi-structured collection of data fields. Some of the top sources include REST APIs, networks of sensors or devices, or the representation of data that originated in other formats. A recent Kaggle project involved web pages converted to JSON with the much lovedBeautiful Soup Python package.

And yet…I wanted to do more than complain in this post. A couple of weeks ago at Strata, we released OJAI™, which greatly simplifies application development on top of Hadoop if you have JSON data sets in your pipeline. You can now use MapR Database as a JSON document database, and it lives in your Hadoop cluster, thus benefiting from all of the other advantages of deploying it there: easy scale out, replication, snapshots (if you’re using MapR), multi-tenancy, and others. It’s the ideal place to start (or to grow) if you’re building a Hadoop or Spark application that needs to persist, manipulate, access or otherwise make sense of JSON documents.

If you’re developing an application like this or have one now, check out the below video that shows how to build a dashboard application with this new document capability in MapR Database. This particular example uses Python as a basis for explaining how to get started with the API, but language bindings are already available for multiple languages.

Hungry for more? Try it it on your own data:

  • Head over to for pointers on getting an instance of the database and using it with OJAI. It only takes a few minutes to get the DB up and running.
  • There’s a VM, docker image, and an AWS AMI available.
  • To talk to the database in Python, you can simply do a ‘pip install maprdb’ and visit the README on github for instructions. The above video also has some code examples.
  • If you’re a Java user, check out this video for an introduction and examples.
  • Node.js bindings for Javascript are also available at this github repo here.

If you’re earlier in the process and are looking around for options, you can check out the datasheet on the web here. Whether you’re using other databases with row/column transformations on JSON, or simply using pickle or lots of text files, you will find this to be a much better solution.

If you have any questions about handling large JSON-based data sets in Hadoop or Spark, ask them in the comments section below.

This blog post was published November 03, 2015.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now