First In-Hadoop Document Database: MapR Database is A Big Win for Big Data

Contributed by

7 min read

There’s good news in the world of NoSQL databases that will put a smile on the face of developers and that should also make business leaders happy because it means shorter time-to-value. You can now enjoy the ease and flexibility of a document-style database with the power of extreme scalability and performance.

At the recent Strata + Hadoop World big data conference in New York, MapR announced the first in-Hadoop document database provided by the addition of native JSON (JavaScript Object Notation) support to MapR Database, their top-ranked NoSQL database. That means a big win for quickly building a wider range of big data applications including continuous analytics on real-time data. The JSON access layer is known as OJAI™ (Open JSON Application Interface). This new capability for the MapR data platform is not a version of MongoDB, but the document-style approach will be familiar for those who like the ease of Mongo and yet are looking for the extreme scalability of HBase: NoSQL just got better.

Forrester Research ranked MapR Database as the strongest "Current Offering" when compared against 14 other leading NoSQL big data technologies.

Download the full report: The Forrester Wave™: Big Data NoSQL, Q3 2016

Download Now

A big audience attended the Strata talk by MapR Chief Application Architect Ted Dunning in order to hear more about how to use this new JSON capability for MapR Database. The talk was titled, “Real World NoSQL Schema Design”. Ted talked about three key characteristics that are highly desirable in an effective large-scale database:

  • Expressive – to be able to state the concepts we need
  • Efficient – runs fast on inexpensive hardware
  • Introspectable – enables user to inspect and understand data and schema

The new MapR Database document database with JSON support meets these criteria. With this new style, rows contain fields, which in turn contain primitive types or complex objects or lists, such as JSON lists. Structure is flexible and is not pre-defined. This approach means that as business needs change, developers can easily make additions and adjustments to the database without long delays or the need for administrative intervention. That spells flexibility and ease of development.

Think about what happens with this type of complex nested data: Not only can tables contain complex data objects, the tables can themselves become complex objects in other tables. “Turtles all the way down” as Ted described it.

The document-style NoSQL database with nested JSON data model is highly flexible. Image from Ted Dunning, used with permission.

Ted talked about several use cases that benefit from this powerful, flexible database technology. One was time series data from IoT sensors, a wide-spread use case. Another example was the surprising degree of simplification that results from this style of database. Ted showed this advantage using Musicbrainz data that exhibits many important idioms found in real databases. These idioms include factoring relations into multiple tables to implement column families, linkage tables, and many-to-one relationships. That’s fine, but the question is, how many tables is too many?

For this data set, artists, albums, tracks, and labels are key objects. Using a traditional relational data model, as Ted explained, would for this example result in 236 tables needed just to describe 7 things. That’s definitely too many tables if you want convenience and speed in development and maintenance of the database.

What happens when you shift development instead to using the nested JSON data model such as that now supported by MapR Database? The number of tables required diminishes dramatically to less than ten. Needing less tables can be a big benefit in many ways, from speed of development to easier administration. When you take into account real-world messiness that is often encountered with a new data set - such as lack of documentation, removal of foreign key relations that could have eased data loading, and just the fact of being unfamiliar with the data – working with less tables simplifies the overall process and can save a lot of time.

The expressivity, efficiency and opportunity for introspection offered by the JSON data model result in far fewer tables needed to represent the data for this Musicbranz example. Image from Ted Dunning, used by permission.

Ted pointed out that the JSON data model is at least as expressive as the original relational model, and many cases are much easier to describe with this approach. As for efficiency, in-lining may increase data size, but locality improves. Sessionizing, on the other hand, substantially decreases data size. In-lining back-references is more efficient than ordinary indexes. In the case of time-series data, for instance, in-lined columnar data results in as much as a 1000x increase in speed.

Once you’ve adopted the new JSON data model, with all its advantages, will that limit your option to use traditional SQL query syntax and SQL-based BI tools? Not if you also add another powerful new tool to the mix: Apache Drill, an open source highly scalable and flexible query engine that supports standard SQL. Ted illustrated this approach showing a Drill example query for “finding Elvis” with the Musicbrainz example, as follows.

Find Discs where Elvis was credited:

select distinct album_id, name
  select id album_id, name, flatten(credit)
  from release
) albums
  select distinct artist_id from (
    select id artist_id, flatten(alias) from artist
    where name like 'Elvis%Presley’
) artists
using artist_id

Apache Drill builds a bridge between the new opportunities in emerging data models such as nested JSON and Parquet and the familiar and widespread expertise, BI tools and applications that are based on standard SQL.

In real life business settings, the best solutions are not really about a single tool or approach. Real success lies in recognizing powerful designs that combine appropriate technologies in highly efficient ways. Use of the nested JSON data model alongside computational frameworks including Spark, Drill and more provides an excellent foundation for modern development that meets competitive business goals.

Additional resources:

The MapR Database document database is now available as a developer preview:
Apache Drill on MapR sandbox plus free training:

This blog post was published October 30, 2015.