Using StreamSets and MapR Together in Docker

Contributed by

4 min read

In this post, I demonstrate how to integrate StreamSets with MapR in Docker. This is made possible by the MapR Persistent Application Client Container (PACC). The fact that any application can use MapR simply by mapping /opt/mapr through Docker volumes is really powerful! Installing the PACC is a piece of cake, too.

Introduction

I use StreamSets a lot for creating and visualizing data pipelines. I recently discovered that I've been installing StreamSets the hard way, meaning I've been downloading their tar installer, but now I'm using Docker, and I'm liking the isolation and reproducibility it provides.

To use StreamSets with MapR, the mapr-client package needs to be installed on the StreamSets host. Alternatively (emphasized because this is important), you can run a separate CentOS Docker container, which has the mapr-client package installed, then you can share /opt/mapr as a Docker volume with the StreamSets container. I like this approach because the MapR installer (which you can download here) can configure a mapr-client container for me! MapR calls this container the Persistent Application Client Container (PACC).

Here is the procedure I used to create and configure the PACC and StreamSets in Docker:

Start the MapR Client in Docker

Here's a short video showing how to create, configure, and run the PACC:

For more information about creating the PACC image, see https://mapr.com/docs/home/AdvancedInstallation/CreatingPACCImage.html.

Here are the steps I used for creating the PACC:

Start StreamSets in Docker

In another terminal session, start the StreamSets Docker container with the following command:

Normally, we would need to install the MapR client on the StreamSets host, but since we've mapped /opt/mapr from the PACC via Docker volumes, the StreamSets host already has it!

Now you need to go to StreamSet's package manager and install the MapR libraries:

You'll see several MapR packages in StreamSets.

  • MapR 6.0.0
  • MapR 6.0.0 MEP 4
  • MapR Spark 2.1.0 MEP 3

You'll want to install the first one, "MapR 6.0.0." That package lets you use the MapR Distributed File and Object Store, MapR Database, and MapR Event Store. If you want Hive and cluster mode execution, then install "MapR 6.0.0 MEP 4" as well as "MapR 6.0.0." If you want Spark, then also install "MapR Spark 2.1.0 MEP 3."

For more details on why the MapR package was split up like this, see this particular commit: https://github.com/streamsets/datacollector/commit/9452a03489ddf8ae2af81be9afaa904c7e766a55#diff-fd75725ca8cdddff01e7533e9b740e44

After you install the package, don't forget to run the setup-MapR script and all that jazz, as described in the setup guide.

You'll be prompted to restart StreamSets. After it's restarted, run these commands to finish the MapR setup:

Restart StreamSets again from the gear menu.

When it comes up, you will be able to use MapR in StreamSets data pipelines. Here's a basic pipeline example that saves the output of tailing a file to a file on the MapR Distributed File and Object Store:


This blog post was published August 01, 2018.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now