Scalable Machine Learning on the MapR Data Platform via SparkR and H2O

Contributed by

10 min read

This is the second installment in our blog series about deep learning. In this series, we will discuss the deep learning technology, available frameworks/tools, and how to scale deep learning using big data architecture. Read the first blog post: "Deep Learning: What Are My Options?"

Introduction

With the release of Spark 2.0, R users are able not only to process large-scale datasets that a single machine cannot handle by leveraging cluster resources but also to apply all advanced algorithms in Spark MLlib and H2O.ai. Most importantly, this capability can be achieved within RStudio, a very popular tool among analysts and data scientists.

This blog is intended to provide a step-by-step guide on how to connect to the MapR Data Platform from R in yarn-client mode via sparklyr. It focuses, in particular, on submitting Spark jobs from RStudio on Mac, running machine learning models on cluster, and collecting output to local machine for visualization.

Prerequisites:

  1. MapR Sandbox, with Apache Spark (2.0 or above) installed
  2. Install the MapR client on Mac OS X
  3. Install R and RStudio on Mac OS X
  4. Install Sparkling Water on Mac OS X Install all required R packages

Getting Started:

  1. Make sure you have the MapR client and Spark installed on your machine by testing /opt/mapr/bin/hadoop fs -ls / and the Spark shell (e.g. version 2.0.1) /opt/mapr/spark/spark-2.0.1/bin/spark-shell --master yarn. You are almost there if you can see the below outputs after executing the command.

Testing MapR client by using Hadoop fs -ls:

Picture 1

Testing Spark by launching spark-shell:

Picture 2

  1. Make sure that the following R packages are installed on your machine:
  2. sparklyr
  3. dplyr
  4. DBI
  5. h2o
  6. rsparkling

In this blog post, we are going to use German credit data, which is available on UCI Machine Learning Repository). This dataset classifies people described by a set of attributes as good or bad credit risks.

Environment Setup and Data Loading

The following R script demonstrates how to set environment variables and create Spark context in RStudio. This set-up will enable users to make use of Spark MLlib from within the R environment.

Sys.setenv(LC_CTYPE = "en_US.UTF-8")
Sys.setenv(LC_ALL = "en_US.UTF-8")
Sys.setenv(SPARK_HOME = "/opt/mapr/spark/spark-2.0.1")
Sys.setenv(HADOOP_CONF_DIR = "/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop")

library(sparklyr)
library(dplyr)
library(DBI)

sc <- spark_connect("yarn-client")

Alternatively, if one would like to set up a H2O cloud inside the Spark cluster and explore machine learning algorithms in Sparkling Water (H2O), the below scripts would make it happen.

Sys.setenv(LC_CTYPE = "en_US.UTF-8")
Sys.setenv(LC_ALL = "en_US.UTF-8")
Sys.setenv(SPARK_HOME = "/opt/mapr/spark/spark-2.0.1")
Sys.setenv(HADOOP_CONF_DIR = "/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop")

library(sparklyr)
library(h2o)
options(rsparkling.sparklingwater.version = "2.0.5")
library(rsparkling)
library(dplyr)

sc <- spark_connect("yarn-client")

There is no header or column names for German credit data, so we need to specify column names as well as column type via the below scripts.

named_vct_colclasses <- c("double", "double", "double", "double", "double", "double",    
"double", "double", "double", "double", "double", "double", "double",    
"double", "double", "double", "double", "double", "double", "double",       
"double")

names(named_vct_colclasses) <- c("creditability", "balance", "duration", "history", "purpose",
                               "amount", "savings", "employment", "instPercent", "sexMarried",
                               "guarantors", "residenceDuration", "assets", "age",
                               "concCredit", "apartment", "credits", "occupation",
                               "dependents", "hasPhone", "foreign")

Next, the German credit data (csv format) will be loaded into a Spark DataFrame for further manipulation.

german_credit <- spark_read_csv(sc, "table_credit","maprfs:///user/yic1/germancredit.csv",  
header = FALSE,
columns = named_vct_colclasses,
infer_schema= FALSE)

It may take a while, depending on data size. Once it’s finished, one may have a rough idea of what it looks like by viewing Spark dataframe german_credit. It has 1000 records and only displays the first 10 rows and 10 columns.

Picture 3

Data Exploration and Feature Engineering

In order to gain a good understanding about data, we can do a quick exploration with the help of dplyr functionality.

german_credit %>% select(creditability, amount)  %>% group_by(creditability) %>%
summarise(num = n(), max_amt = max(amount), min_amt = min(amount))

For example, the above script gives the number of people in each credit risk group as well as the minimum and maximum amount. Seven hundred people are in the good credit risk group with 300 people in the bad credit risk group. People in the good credit risk group tend to have a lower credit amount than those in a bad credit risk group. Please keep in mind that all these operations happen on MapR Sandbox, rather than on my Mac. My Mac just serves as a point to submit tasks to be processed on a MapR cluster.

Picture 4

We intend to build a model to predict creditability. Before doing this task, we need to prepare some features. For instance, “amount” is a continuous variable, and we want to convert it into distinct ranges for model building.

df_credit <- ft_bucketizer(german_credit, input_col = 'amount',
                          output_col = 'amount_disc', splits = c(0, 1000, 5000, 10000,20000))

By checking Spark dataframe df_credit, “amount_disc” as the new feature, we created four distinct categories. And

df_credit %>% select (amount_disc) %>%
             group_by(amount_disc) %>%
             summarise(num = n()) %>%
arrange(amount_disc)

gives the number of people in each category as below.

Picture 5

Model Building and Evaluation

After extracting the features, the data are divided into a training and a testing set, with a ratio of training to testing 3:1. Please note that new feature “amount_disc” is used in the decision tree modeling instead of “amount.”

partition_credit <- sdf_partition(df_credit, training=0.75, testing=0.25, seed = 1099)

response_variable<-"creditability"
feature_list<- c("balance", "duration", "history", "purpose", "amount_disc", "savings",  
                       "employment", "instPercent", "sexMarried", "guarantors", "residenceDuration",
                       "assets", "age", "concCredit", "apartment", "credits", "occupation",  
                       "dependents", "hasPhone", "foreign")

model_dt<-partition_credit$training %>%
                   ml_decision_tree(response_variable, feature_list, type = "classification")

Please note that this model building workload leverages the entire cluster’s resources and is highly scalable for large datasets. Next, we will use a testing dataset to evaluate the performance of the model (e.g., plotting ROC curve).

pred_credit <- sdf_predict(model_dt, partition_credit$testing) %>%
                         select(creditability, prediction) %>%
                         collect

Actual value and predicted value of creditability are collected to the memory of my Mac, so that we can make full use of other libraries in CRAN to plot graph or get some more metrics.

##Using ROCR to make the chart
library(ggplot2)
library(ROCR)

pred <- prediction(pred_credit$prediction, pred_credit$creditability)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]

roc.data <- data.frame(FPR=unlist(perf@x.values),
                      TPR=unlist(perf@y.values),
                      model="Decision Tree")

ggplot(roc.data, aes(x=FPR, ymin=0, ymax=TPR)) +
 geom_ribbon(alpha=0.2) +
 geom_line(aes(y=TPR)) +
 ggtitle(paste0("ROC Curve w/ AUC=", auc))

One of the important metrics, AUC (area under the curve), is 0.697 in our case. One may refer to the ROCR package for more metrics.

Picture 6

If one believes that AUC for the decision tree is not good enough and would like to apply Gradient Boosting Machine (GBM), which is a popular classification algorithm in H2O, the first thing one need to consider is converting a Spark DataFrame to an H2OFrame.

df_credit_hf <- as_h2o_frame(sc, df_credit)
df_credit_hf[,response_variable] <- as.factor(df_credit_hf[,response_variable])

Next, we will partition the data within H2O instead of Spark and train a GBM model using an H2OFrame.

# Split the H2O Frame into training & test sets; default ratio is 3:1
splits <- h2o.splitFrame(df_credit_hf, seed = 1)

# Train an H2O GBM using the training H2OFrame
fit_gbm <- h2o.gbm(x = feature_list,
              y = response_variable,
              training_frame = splits[[1]],
              learn_rate = 0.01,
seed = 1)

print(fit_gbm)

While one can certainly fine-tune some parameters in h2o.gbm via a “grid search” approach to achieve more promising results, that is not the focus of this post. We are going to evaluate the model using a testing dataset, and AUC gets increased to 0.79.

##evaluate the performance of the GBM using a test set.
perf_h2o <- h2o.performance(fit_gbm, newdata = splits[[2]])
print(perf_h2o)

Finally, we make predictions on the testing dataset and convert those predictions from an H2OFrame to a Spark DataFrame.

##prediction
pred_hf <- h2o.predict(fit_gbm, newdata = splits[[2]])
head(pred_hf)

Picture 7

## convert an H2OFrame into a Spark DataFrame
pred_sdf <- as_spark_dataframe(sc, pred_hf)
head(pred_sdf)

Picture 8

Conclusion

This blog post demonstrates how to connect to the MapR Platform from RStudio via Sparklyr and leverage the platform’s resources to offload the data manipulation workload and build machine learning models. Spark MLlib and H2O have a rich set of machine learning algorithms for large-scale datasets. Sparklyr brings tremendous value to data practitioners familiar with R, giving them an opportunity to embrace Apache Spark in the big data era while continuing to use their preferred tools.

MapR as an enterprise-grade data platform enables data scientists to perform analytics on consistent data in both development and production environments. Its unique features, such as snapshots and mirroring, make it possible to trace down the source data used to build the model. Compliance in model building often means having traceability and verification of the source data and is required in many organizations (e.g., banking, insurance, etc,). MapR provides multi-tenancy, which also allows different departments within an organization to use a secure, single platform for the development of reliable, scaled, and converged applications.

Additional Resources

Read Part 1 blog "Deep Learning: What Are My Options?"

Read blog "TensorFlow on MapR Tutorial: A Perfect Place to Start"

Watch video "Keeping Big Data Containers Lightweight" by Ted Dunning


This blog post was published May 03, 2017.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now