Demo: Apache Spark on MapR with MLlib

Contributed by

12 min read

Editor's Note: In this demo we are using Spark and PySpark to process and analyze the data set, calculate aggregate statistics about the user base in a PySpark script, persist all of that back into MapR Database for use in Spark and Tableau, and finally use MLlib to build logistic regression models.

Here's the transcription:

Hi, I'm Nick Amato with MapR. I'm going to show you an example of how to use Apache Spark with the MapR Hadoop Distribution to build a simple real-time dashboard, and I hope you can use this example for thinking about using Spark on MapR with your own data. The scenario is a service where are offering music streaming to customers. This will give us a couple of example datasets for Spark. On the left side of this diagram, we have customers connecting to our service, maybe on their iPhone or Android device. This leads us to our first dataset, which is the individual events of everything that's happening, like what user listened to what track, where were they when it happened, and things like that.

This could be a typical kind of freemium type of service where we have both free and paying customers at different levels. I will use Spark and PySpark to process and analyze the dataset. In a PySpark script, I will calculate some aggregate statistics about the user base that we would like to know, persist all of that back to MapR Database so we can use what we computed in Spark in Tableau, and finally use MLlib to build a couple of logistic regression models. Once we have the models, let's say it's the end of the quarter and we want more paying customers. We can use the models to pick the right customers to play back and offer for an upgrade to the paid service.

Ultimately, we want more paying customers, but we want to minimize the cost of ads. Of course we don't want to drive any customers away, so this example will allow us to do that. All of the code and datasets I talk about here can be downloaded in this get help repo if you would like to try it out yourself or use any of the code. Here we go.

You can see I have a three-node Hadoop cluster with one node running the Spark master service, and two running as Spark slaves. Since this is a MapR enterprise edition cluster, it has the integrated NoSQL database, MapR Database. I have set up three tables, each with one column family. That's where we are going to be writing the output of Spark to build our dashboard. One thing to notice here is that these tables are paths into the MapR Distributed File and Object Store, and they are integrated with Hadoop, so with MapR Database I'm going to get the best of both worlds. I'm going to get all of the benefits of the MapR random redrive file system, things like consistent snapshots, volume placement, mirroring to other clusters, and things like that.

We get around 2-7 times faster operations in Hbase, and I can use all of the same APIs, so I don't need to change any of my application code. The other thing I have here is Apache Drill configured, and that allows me to use SQL from Tableau to query the data, both in MapR Database and in the file system. It's also going to make it very easy to bring in other data sources without doing a lot of up-front work or knowing the schema.

Okay, let's look at the datasets. We have a couple of them. The input into our dashboard is the tracks file. This is about a million lines of data and represents what we might see from individual events from listeners. Every time a customer listens to a track, we align this file. The fields in this file are the event ID, the customer ID, which is a foreign key into a customer table, the track ID, that's another foreign table into a music table, the local time and date when they listened to the track. This field is zero or one, indicating whether or not they were listening on a mobile device, and finally a ZIP code indicating where they were physically at the time they listened.

The next dataset is the customer database stored in MapR Database, and this contains things like the ID, when the customer first enrolled, name, address, things like that. No surprises there. Finally, we have a clicks file that shows which customers clicked on previous offers containing which offer it was and when. We will use that to train our model.

You can see the general flow of this example is like this. We first make what's called a pair RDD out of the input data, calling our function to do that here, and then reduce by key to consolidate all of those individual lines back into another RDD. This gives us sort of a key value form where the user is the key, and the values are all of the track events for that user. For each user, we will call Map values and compute the statistics by user, and that turns into another RDD, which is also keyed by user. Now we will use a part of MLlib in Spark called Call Stats. This is part of the statistics package to compute some aggregate data for the entire user base.

Moving on to clicks on previous offers, we will sort this by key, which again is the user ID. We can do a simple yes or no on whether the user previously clicked on the offer. Finally, this script will write a summary for each user back to MapR Database, and a row of training data, which we can use as input into MLlib. This will be in Lib SBM format, which makes things a lot easier when reading the data back in to make classifications.

Let's start the Spark job. You can see that while it's running, if we tail this file, we can see all of the features being generated. Okay, the job is done. Let's build the dashboard. Now looking at the drill explorer, I can see my data in the various tables here, and just to make things easier, I have made some views with SQL that I will reference in the Tableau worksheet so you will see in a second. I'm in Tableau. If I bring in the live table and the customer table, Tableau will automatically join them on customer ID.

I saved the individual worksheets looking at all of the charts that I want. I have a map of customers by service level, what campaigns are driving people to sign up for this service, new things here like listening habits and whether customers are primarily a morning, afternoon, evening, or night listener, what share of tracks are mobile, and how many unique tracks. Now I can put all of this together in a dashboard, and it looks like this, with all of these metrics in one place, and you can see all of them showing up here. This is all computed very quickly based on the entire dataset, so you always have the latest information. I can take this a step further and also look at the individual profiles and see how they compare to average, say if a customer is talking to our support team, I can look at what their habits are individually, maybe compare to the average.

Now say it's nearing the end of our quarter, and what we really want to do is get more paying subscribers. There's a cost associated with playing some offer to our customers, so we don't want to play it to everyone, but when that customer connects to the service to listen to music, we want to determine quickly whether or not we should play the offer to them. This might be like a display ad or an audio offer as part of their listening stream or something like that.

Let's run the Spark job with regression model to get that, and look at what it does. I'll start the job here, and while it's running, I'll show you what it's doing. This is another PySpark script that pulls the features file from the job that we just ran and loads it in. You can see that there's a nice easy library function to do that here. The usual way to do this is to split the data into a training and a test set so we can train the model on the training set and then evaluate it on some rows that it hasn't seen before. This is so we can get an estimate of how good the classifications are.

In this file, we trained two regression models, then measured the number of times it was correct, and then print that to the output. It's possible to do a lot more in terms of accuracy, but to keep things simple for this video, I will just get a simple measurement of the error to see just broadly how well this is working. Okay, the job finished, and you can see the two models differed slightly on their error rate, and the training and test sizes look right since we split our 5,000-row customer data into 70/30. This is how as customers connect, I ought to be able to make a quick decision as to whether or not I'd play the offer.

Back in Tableau, we can see this is what our classifications would look like with the model output being maybe a zero or one or true or false, classifying each customer in real-time as they connect, and improving the model with the latest data as it arrives. This was an example of how MapR and Spark work together end-to-end as a platform for your data. Thanks for watching. Visit us at

This blog post was published May 27, 2015.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now