Real-Time Twitter Analytics with MapR Data Science Refinery in the Clouds

Contributed by

6 min read

Introduction

MapR Orbit Cloud Suite is the perfect foundation for organizations choosing to build their own cloud platforms. MapR is committed to an excellent user experience when deploying on the cloud. We have cloud marketplace offerings on AWS and Azure to accelerate the deployment process so our customers can benefit from all MapR and the Cloud have to offer.

With the MapR Data Science Refinery, MapR provides businesses with a suite of data science tools to distill insights from their data and turn those insights into operational next-gen applications.

MapR Data Science Refinery

In this blog demo, we walk you through deploying a MapR sandbox in the cloud (AWS and Azure) to analyze real-time tweets with the Data Science Refinery. We use Apache Zeppelin to analyze the tweets with tools like Apache Drill, Apache Spark, and Apache Hive, which are all integrated with the MapR Data Platform. All three major components of the MapR Data Platform - MapR XD, MapR Event Store, and MapR Database - actively participate in the process for an optimal user experience. Customers no longer need to deploy third-party streaming software such as Apache Kafka or a NoSQL database such as Apache HBase. With MapR, configuring your environment is easy and eliminates the complex installation and configuration headaches that occur with third-party software.

In this scenario, tweets are being streamed into MapR Event Store by a producer container while a consumer container subscribes to the tweet topic, sanitizes the tweets, then writes to MapR Database. A sample Zeppelin notebook is also provided to demonstrate how Data Science Refinery can utilize the underlying tools (Drill, Spark, and Hive) to analyze the tweets and visualize the results in charts. The data flow is described in the architectural diagram below.

Data Flow Architectural Diagram

Editor's note: The custom code to reproduce this example, including the AWS Cloud Formation Script and Azure QuickStart Template can be found here: https://github.com/jsunmapr/tweets-dsr-demo

Prerequisites

A Cloud Account (AWS or Azure)

First, you need to sign up for a cloud account on either AWS or Azure. If you don't have one, you can go to the cloud provider's website to get one. Once you sign up, make sure you have enough credits, adequate quota for VMs, and privileges to create resources in the cloud. Here is a link to MapR documentation showing how to create proper IAM privileges if you do not already have them. https://community.mapr.com/docs/DOC-2301-manual-aws-deployment - jive_content_id_Create_Policies_and_Roles

A Twitter Account

Second, you need a Twitter account to capture the tweets for analysis. You can create a new account or use an existing one. After you have signed up, go to https://apps.twitter.com to create an application. Once the application is created, you will be given four credentials: a customer key, a customer secret, an access token, and an access secret. Save these credentials for later use. See the example screenshot below.

Four Credentials

Deploy the MapR Cloud Sandbox

The easiest way to quickly deploy VMs in the cloud is to use the cloud provider's solutions template. In AWS, it is called the CloudFormation template. In Azure, it is called the Azure Resource Manager (ARM) solutions template.

In case you want to install MapR manually, please refer to these MapR community blogs to get you started:

To make the deployment easy, I have prepared automated scripts along with the cloud templates to streamline the installation.

For AWS users, click one of the following links depending on where you are physically located (four regions are supported):

In the CloudFormation template, at a minimum you should type in your AWS key name and the Twitter credentials. Leave everything else as default. See the example screenshot below.

CloudFormation template

For Azure users, click the following link for all regions: https://tinyurl.com/y9dhzyb2

In the ARM template, at a minimum you should choose a valid subscription, type in a resource group name and the Twitter credentials, and leave everything else as default before hitting the "Purchase" button. See the example screenshot below.

Custom Deployment

In about 30 minutes, you are notified the deployment is complete. Now you need to locate the URL for Zeppelin UI access. In AWS, go to the CloudFormation service, highlight the CloudFormation stack named "MapR600TwitterDSRDemo" then select the "Outputs" tab below to find the username, password, and URL to the Zeppelin UI. See the screenshot below.

CloudFormation Stack

For Azure, click the "Deployment Succeed" message -> "Microsoft.Template." The URL and login information are listed under "Outputs," See the example screenshots below.

Microsoft Template

Deployments

Microsoft Template

Once you have the Zeppelin URL, point your browser at it and log in. You should see a sample notebook named "tweets." See the example screenshot below.

Zepplin URL

Click the tweets notebook to see several queries based on Drill, Hive, and Spark. Click the play buttons one paragraph at a time to see the chart generated by your query.

Generated Chart

Summary

In this blog, we demonstrated how to use a cloud provider's solutions template to quickly deploy a MapR cloud sandbox. The sandbox also has container services performing the tasks of publishing raw tweets to MapR Event Store, consuming these tweets,


This blog post was published January 18, 2018.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now