How To: Run the MapR Data Science Refinery from an Edge Node

Contributed by

5 min read

Recently, MapR launched the MapR Data Science Refinery, a novel way to deliver data science functionality and connectivity for your MapR Data Platform.

One of the great advantages to this is the ability to deploy this workspace from wherever you chose to do your work; an edge node, a cloud instance, or even your personal laptop!

Below are the steps that are required to run this from an edge node. This could be from an on-premises server or a cloud/VM deployed edge node, and it only requires that a supported flavor of Linux be installed on the node that you intend to use. The supported Operating Systems are:

  • CentOS 7.x
  • Ubuntu 14
  • Ubuntu 16

First, you need to install and start the Docker Environment for your operating system. You'll be given a choice between Docker Community Edition (CE) and Docker Enterprise Edition (EE), and either work for this purpose.

Once you have this installed, you need to pull the image into your local Docker image repository. Our Docker Hub is located here, and the pull command that you should use to pull the most recent version is:

$docker pull maprtech/data-science-refinery

After you've run this, you can see that this image now exists in your registry by running:

$docker images

The only piece that you have to have in place at this point, for a secure cluster, is your MapR-SASL ticket, available somewhere on this host. For steps for generating this ticket, please see this document:

Administrator's Reference for 'maprlogin'

We recommend creating an environment variable file instead of passing these into the Docker Run command as it's easier to spot problems. Here is an example file, 'env.list' that we pass into the Docker Run command:

MAPR_HS_HOST=<needed if you're using Pig>

Next, you simply use the Docker Run command, passing in the an . For more information on this command and options, please visit this document:

Understanding Zeppelin Docker Parameters

docker run --rm -it --env-file ./env.list --cap-add SYS_ADMIN --cap-add SYS_RESOURCE --device /dev/fuse -p 9995:9995 -p 10000-10010:10000-10010 -v </path/to/ticket/file>:/tmp/dsr_ticket:ro -v /sys/fs/cgroup:/sys/fs/cgroup:ro

That's it! Now you can log into Zeppelin by visiting the UI at the following address:

https://<IP or hostname of host Docker is running on>:9995/

And you log in using the credentials that you provided in the Docker Run command. The authorization for the jobs themselves–whether Spark, POSIX, or JDBC–is provided by your MapR-SASL ticket.

In addition, you can peruse the file system using POSIX or Hadoop syntax from the CLI or Zeppelin. This is made possible by the MapR POSIX Client For Containers, which allows MapR customers to mount their global namespace to their Docker container.

$ ls -la /mapr/
total 3
drwxr-xr-x 10 mapr mapr 9 Nov 27 08:55 .
dr-xr-xr-x 3 root root 1 Dec 16 17:43 ..
drwxr-xr-x 3 mapr mapr 1 Nov 27 08:51 apps
drwxr-xr-x 2 mapr mapr 0 Nov 27 08:48 hbase
$ hadoop fs -ls /
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/mapr/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Found 8 items
drwxr-xr-x - mapr mapr 1 2017-11-27 08:51 /apps
drwxr-xr-x - mapr mapr 0 2017-11-27 08:48 /hbase

Common Problems:

After running the Docker Run command, you see the following error:

Started service mapr-posix-client-container                [FAILED]

This error can be safely ignored as it is a remnant of an issue with the MapR Persistent Application Client Container (PACC).

You're prompted to go to an unsafe site by your web browser when visiting the Apache Zeppelin UI:

This is okay and expected behavior if you haven't installed an SSL certificate for this instance.

More troubleshooting information can be found here:

Troubleshooting Data Science Refinery

This blog post was published December 16, 2017.