4 min read
MapR made a design goal to be both portable and extensible in this release to enable all types of data science teams. This means that, while we don't ship every possible tool that users will want, we have the right structure in place to allow them to install those tools and have them work seamlessly with direct data access to their MapR Data Platform.
For R Studio, this is accomplished through the Sparklyr project, which is intended to provide the ability to:
Due to the design of the DSR container, R Studio will inherit the security configuration of the container, and jobs will be submitted as the user specified by the MapR-SASL ticket or in Docker Run.
In order to access the R Studio GUI from your web browser, you will need to pass a port mapping into Docker Run. By default, R Studio uses 8787, and this can be passed in as such:
docker run ... -p 8787:8787...
There are some considerations on container size when adding projects to DSR. It might behoove you to remove Apache Zeppelin from DSR, if you aren't planning to use it, in order to keep memory requirements down. Here are the specs from my testing with DSR v1.1:
If you want to remove Zeppelin before starting this install, you can do so with the following commands:
rm -rf zeppelin/ rm -rf /opt/mapr/zeppelin
My recommendation is that if you plan to use R Studio regularly, you save the container at the end, when you have your configuration complete using Docker Commit.
To begin with, grab the most recent open source build of R Studio for your OS - we're going to use CentOS 7 here:
Then, install from this RPM including libcurl, openssl, and xml2:
sudo -u root yum install rstudio-server-rhel-1.1.442-x86_64.rpm libcurl-devel openssl-devel libxml2-devel rm rstudio-server-rhel-1.1.442-x86_64.rpm
As soon as the install is complete, you should be able to log into R Studio, using the credentials that you specified in Docker Run at http://[hostname]:8787
To install the latest versions of Sparklyr and Dplyr, we recommend doing so via R Studio's DevTools package, as this will allow you to pull the most recent build:
install.packages("devtools") devtools::install_github("rstudio/sparklyr") devtools::install_github("tidyverse/dplyr")
These will each take awhile to install. But you should see something like this, when they're complete:
Now you just have to set SPARK_HOME, and create the Spark connection:
library(sparklyr) options("sparklyr.verbose" = TRUE) Sys.setenv(SPARK_HOME="/opt/mapr/spark/spark-2.1.0") sc <- spark_connect(master = "http://localhost:8998",method = "livy")
To test that this is working, I recommend loading the built-in Iris dataset as a table into a Spark context:
library(dplyr) iris_tbl <- copy_to(sc, iris)
Notice that up in the right-hand side, we can see this table in the Connection window:
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.