How To: Using R Studio with the MapR Data Science Refinery

Contributed by

4 min read

MapR made a design goal to be both portable and extensible in this release to enable all types of data science teams. This means that, while we don't ship every possible tool that users will want, we have the right structure in place to allow them to install those tools and have them work seamlessly with direct data access to their MapR Data Platform.

For R Studio, this is accomplished through the Sparklyr project, which is intended to provide the ability to:

  • Connect to R from Spark
  • Perform functions on data in Spark structures and then bring the results into R for analysis and plotting
  • Allow R to leverage distributed Spark ML

Due to the design of the DSR container, R Studio will inherit the security configuration of the container, and jobs will be submitted as the user specified by the MapR-SASL ticket or in Docker Run.


In order to access the R Studio GUI from your web browser, you will need to pass a port mapping into Docker Run. By default, R Studio uses 8787, and this can be passed in as such:

docker run ... -p 8787:8787...

There are some considerations on container size when adding projects to DSR. It might behoove you to remove Apache Zeppelin from DSR, if you aren't planning to use it, in order to keep memory requirements down. Here are the specs from my testing with DSR v1.1:

  • Size of DSR with Apache Zeppelin: 6.12 GB
  • Size of DSR with Apache Zeppelin + R Studio: 7.157 GB
  • Size of DSR with R Studio: 6.147 GB

If you want to remove Zeppelin before starting this install, you can do so with the following commands:

rm -rf zeppelin/
rm -rf /opt/mapr/zeppelin

My recommendation is that if you plan to use R Studio regularly, you save the container at the end, when you have your configuration complete using Docker Commit.

Install R Studio

To begin with, grab the most recent open source build of R Studio for your OS - we're going to use CentOS 7 here:


Then, install from this RPM including libcurl, openssl, and xml2:

sudo -u root yum install rstudio-server-rhel-1.1.442-x86_64.rpm libcurl-devel openssl-devel libxml2-devel
rm rstudio-server-rhel-1.1.442-x86_64.rpm

Log into R Studio

As soon as the install is complete, you should be able to log into R Studio, using the credentials that you specified in Docker Run at http://[hostname]:8787

Install Sparklyr and Dplyr

To install the latest versions of Sparklyr and Dplyr, we recommend doing so via R Studio's DevTools package, as this will allow you to pull the most recent build:


These will each take awhile to install. But you should see something like this, when they're complete:

Now you just have to set SPARK_HOME, and create the Spark connection:

options("sparklyr.verbose" = TRUE)
sc <- spark_connect(master = "http://localhost:8998",method = "livy")

Test Spark Connection

To test that this is working, I recommend loading the built-in Iris dataset as a table into a Spark context:

iris_tbl <- copy_to(sc, iris)

Notice that up in the right-hand side, we can see this table in the Connection window:

This blog post was published March 23, 2018.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now