The MapR Docker Container for Developers

Contributed by

6 min read

MapR has released a lightweight docker container which enables developers to conveniently run a single node MapR cluster on their laptop so they can access data in the MapR platform directly from IDEs, database clients, and other software development tools. The container is free to use and includes the following components:

  • Analytical engines: Drill and Spark. You can use their respective APIs programmatically or with command line tools such as sqlline and spark-submit.
  • MapR Control System: the primary web interface for controlling a MapR cluster. You can also use the maprcli command to configure streams, tables, file system volumes, and other management tasks.
  • MapR-XD file system: distributed storage for all data. You can use the hadoop fs command to operate on files and directories.
  • MapR-DB: a NoSQL database for JSON and Binary data. You can interact with MapR-DB using Open JSON Application Interface (OJAI) API, HBase API, and with the command line tools mapr dbshell and hbase shell.
  • MapR Streams: distributed storage for real-time data. You can publish and consume messages to streams using the Kafka API or Spark Streaming API.

Getting Started

Rather than going into detail about how to setup the container and access everything in it, I'm just going to reference the docs then illustrate a common workflow that integrates this container in a software development environment.

The installation instructions are here.

The following two repositories describe how to load data into files, tables, and streams and process them with Drill and Spark. These tutorials were written specifically with the MapR Container for Developers in mind:

Common Workflows

One of the most common use cases for this container involves running code from an Integrated Development Environment (IDE) that accesses data and invokes analytical engines like Spark or Drill on a MapR cluster. This process will typically always include the following steps:

  1. Copy initial datasets to the container or create an application that originates datasets.
  2. Create a MapR-XD volume, MapR-DB table, or MapR Streams topic.
  3. Insert datasets into said volume, table, or stream.
  4. Develop and test application code within the IDE.

Lets look at an example project that shows this workflow from start to finish. We'll use the code examples in Getting Started with MapR-DB JSON which analyze data in the Yelp Open Dataset. After we download that dataset we need to copy it to the MapR-XD file system on the container. Typically, files can be copied to MapR-XD by copying them to an NFS mount point but NFS is not available in the MapR Developer Container so we need to use the hadoop fs -put command, like this:

sudo /opt/mapr/bin/hadoop fs -put ~/Downloads/dataset/business.json /tmp

It's easy to confuse Unix and MapR-XD namespaces in the hadoop fs command, so let me clarify. The first parameter to hadoop fs -put references a directory in the standard file system on your laptop. The second parameter references a volume (i.e. directory) in the MapR-XD cluster file system. To list files in MapR-XD use hadoop fs -ls <dir>.

After we put business.json in MapR-XD we can import that file into a MapR-DB table, like this:

sudo /opt/mapr/bin/mapr importJSON -idField business_id -src /tmp/business.json -dst /apps/business -mapreduce false

That command saves a MapR-DB table called business in the /apps volume. The default permissions for new tables are secure but we can make them easier to access from an IDE by granting public access, like this:

ssh root@localhost -p 2222 "maprcli table cf edit -path /apps/business -cfname default -readperm p -writeperm p"

Until now, all the commands we've mentioned can run on the docker host (i.e. your laptop's Terminal app). However, maprcli commands are not available with the MapR client installed on your laptop, which is why I'm showing it as a command issued via ssh.

Once data has been loaded into the /apps/business table and its access permissions have been set to public we can open an IDE and programmatically access that table. For example, we can run the DRILL_001_YelpSimpleQuery program with a Run configuration that looks like this in the IntelliJ IDE:

intellij_run_config.png

Running that example generates output similar to what's shown below:

intellij_run_example.png

That example ran a SQL query programmatically with the Drill API. We can see and rerun that query from the Drill web console as shown below:

drill_webui.png

The following video demonstrates the workflow described above. It shows how an IDE can be used to run and debug an application that uses MapR-DB in the MapR Container for Developers.

MapR Developer Container walkthrough

Conclusion

MapR has released a Docker container which allows developers to setup a single-node MapR cluster on their laptop. This makes it easier than ever to connect a development environment to a cluster for accessing analytical engines such as Spark or Drill and the MapR core components for database, streaming, and file storage.

If you'd like to learn more about the MapR Container for Developers check out the following resources:

  1. MapR Docs: The MapR Container for Developers
  2. Getting Started with MapR-DB JSON
  3. Getting Started with MapR Streams

This blog post was published January 15, 2018.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now