13 min read
Twitter, as we all know, is a powerful social media platform that can be used to harness incredibly useful information about products, brands and customer experience. This blog will explain how to:
To configure the environment (on AWS), we will go through the following steps:
In this section you will create a Twitter development account and register a Twitter Application that will allow you to establish a Twitter feed. It also explains how to get the required Twitter credentials required by Flume to establish Twitter as a source.
This section describes how to provision a preconfigured MapR node on AWS that is already configured with Flume and Drill, as well as the specific elements to support data streaming from Twitter and Drill query views.
The AMI will have a 6GB root drive and 100GB data drive. Please note that it is a small node, and very large volumes of data will slow the response time significantly for Twitter data queries.
Make sure that the instance has an external IP assigned; an Elastic IP is preferred, but not essential. Also verify that a security group is used with open TCP and UDP ports on the node. At this time, all ports are left open on the node.
Once the instance has been provisioned and booted up, you have to reboot the node in the AWS EC2 management interface to finalize the configuration.
The node should now be configured with the required Flume and Drill installation, and all that is required is to update the Flume configuration files with the required credentials and keywords.
See the sample files located at: https://github.com/mapr/mapr-demos/tree/master/drill-twitter-MSTR/flume
For the flume.conf file, enter the Twitter app credentials from the first section, and also the desired keywords, separated by a comma. Keywords can include multiple words separated by a space. Additionally, Tweets can be filtered for specific languages by entering the ISO 639-1 language codes separated by a comma. If no language filtering is required, simply leave the parameter blank. For language codes, see: http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
To start Flume and the data stream, simply go to the
First go into a Linux screen terminal by simply typing “screen” in the command line.
./bin/flume-ng agent --conf ./conf/ -f ./conf/flume.conf -Dflume.root.logger=INFO,console -n TwitterAgent
You can exit screen by entering Ctrl+a and then hit d to detach. To go back to the screen terminal, simply enter screen –r to reattach.
Twitter data will now be streaming into the system. You can verify volumes by executing du –h /mapr/drill_demo/twitter/feed.
Please note that it takes a while to build up a volume of data in the feed directory. You should allow at least 20-30 minutes to start noticing data in the feed directory.
Drill is already configured and ready, but the data needs be present in the feed directory before any of the queries will function.
MicroStrategy provides an AWS instance of various sizes. It comes with a free 30-day trial for the MicroStrategy instance, but note that AWS charges still apply for the platform and OS.
This section covers the steps to provision the MicroStrategy node in AWS.
To start, go to the MicroStrategy website: http://www.microstrategy.com/us/analytics/analytics-on-aws
It is very important to make sure that the MicroStrategy instance has a Public IP; elastic IP is preferred but not essential.
The instance is now accessible with RDP and is using the relevant AWS credentials and security.
For more information, see: http://www.microstrategy.com/Strategy/media/downloads/products/cloud/cloud_aws-user-guide.pdf
In this section, we will go through the steps to configure MicroStrategy to integrate with Drill using the ODBC driver. In addition, we’ll cover how to install a MicroStrategy package with a number of useful prebuild reports for working with Twitter data. These reports can be modified as needed, or used as a template to create new and more interesting reports and analysis models.
NOTE: For Quick Start, the v0.08.1.0618 version of the ODBC driver can be used, which is located here: https://package.mapr.com/tools/MapR-ODBC/MapR_Drill/MapRDrill_odbc_v0.08.1.0618/MapRDrillODBC32.msi
The Quick Start package requires that a System DSN named ‘Twitter’ is configured with the ODBC administrator.
The Drill object is part of the package and doesn’t need to be configured.
Make sure that you use the AWS Private IP if both the MapR node and the MicroStrategy instance are located in the same region (which is recommended).
Download the configuration package for MicroStrategy on the Windows system here: https://github.com/mapr/mapr-demos/blob/master/drill-twitter-MSTR/MSTR/DrillTwitterProjectPackage.mmp
(You can either use Git for Windows or the full GitHub for Windows).
First, create a new project with MicroStrategy Developer:
Click on “Create Project” and type a name for the new project.
It is not required to do any steps after the initial create project step. Simply click OK.
The Project should now be visible in MicroStrategy Developer.
Open MicroStrategy Object Manager.
Connect to the required Project Source and login as Administrator.
Select the project that the package should be loaded into.
Then, go to the Tools menu and select Import Configuration Package.
Open the configuration package file and click “proceed.”
The package with the reports will now be available in MicroStrategy.
The reports can be tested and modified in MicroStrategy Developer, and also permissions can be configured as needed.
First, update the schema by clicking on the Schema menu and selecting “Update Schema.”
Select all check boxes and click “Update.”
To create a user and set the Administrator password, expand Administration, then User Manager and click on “Everyone.”
Right click to create a new user, or click on Administrator to edit the password.
The package contains reports in three main categories:
These reports can be copied and modified as needed, and serve as a template on how to query the Twitter data using Drill. There are 18 reports in the package, and most include prompts to allow the user to specify date ranges, output limits where relevant, and enter specific terms as needed.
The reports can be accessed through MicroStrategy Developer or the web interface. The web interface provides easy access to work with the reports and make them available to other users. MicroStrategy Developer provides a more powerful interface to modify reports or add new reports, but requires RDP access to the node.
Using a web browser, enter the URL for the web interface:
Log in with the User (created previously) or Administrator
NOTE: This requires the credentials created initially with Developer.
Once logged in, choose the project that was used to load the analysis package.
Then select “Shared Reports” and the folders with the three main categories of the reports will be visible.
Some reports will require prompts before executing.
Enter the parameters and click on “Run Report” to execute.
Report formatting can be done in the web interface, and various other functions.
To refresh the data or re-enter prompt values, click on the Data Menu and then select Refresh or Re-prompt.
The reports will be located in the Public Objects folder of the project that was chosen to install the package in.
Many of the reports will require user input in the form of prompts to select the desired data. In this example we will select the Top Hashtags report in the right-hand column.
This report requires a Start Date and End Date to specify the date range for data of interest; the default values of the prompts are to select data for the last two months, ending with the current date.
In addition you can specify the limit for the number of Top Hashtags to be returned; the default is to return the top 10 hashtags.
The final result is then displayed as a bar chart with the hashtag and number of times it appeared in the specified data range.
Below are a couple of samples of other reports available in the bundle.
Total volume of tweets by hour
Top Retweets for a date range with original Tweet date and count in the date range.
In this tutorial, you learned how to configure an environment to stream Twitter data using Apache Flume. You then learned how to analyze the data in native JSON format with SQL using Apache Drill, and how to run interactive reports and analysis using MicroStrategy. Let us know if you have any feedback on the tutorial, or if you are running into any issues.
Here are some links you will find useful for getting started with Apache Drill:
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.