Twitter Feed Fuels Real-time Hadoop with Storm and MapR at the Strata Conference

Contributed by

3 min read

Hadoop users were excited to see the real-time Hadoop analytics demonstration at the Strata Conference in Santa Clara. By streaming the #strataconf twitter hashtag directly into a cluster during the conference, MapR displayed two real-time tag clouds showing a word bubble with the most frequently used words in conference tweets and a user name cloud of top tweeters. Watching the information change proved mesmerizing for some.

How did we do this? By bringing MapR and Storm together to capitalize on their strengths.

Real-time analytics are becoming common-place in businesses today. Data sources include social media, stock tick data, network sensors, payments, and ad impressions. Rarely does one tool fit all of these use cases, data feeds, and analysis needs of today's enterprises. Hadoop's venerable MapReduce framework has proven its worth at scale, but it comes with a price: higher latency. For interactive querying needs, or even moreso, real-time stream computation requirements, traditional Hadoop distributions haven't played much of a role, and need to be augmented with other solutions. However, the characteristics of MapR's data platform allow us to interact with these lower-latency systems.

Take Storm for example. Storm was written by Nathan Marz at Backtype/Twitter and is used as a continuous, distributed stream computation engine for the massive amounts of tweets they need to process. Much like Hadoop, Storm hides the complexity of these systems, and allows you to focus on your business problem, not the underlying system. Usually, Storm gets its data from a queuing system like Kafka or Kestrel. One of the most common things to do at the end of the real-time workflow is to write the raw data to Hadoop for batch analysis in a less time-sensitive setting. Delivering Streaming Writes for Realtime Applications

Writing directly to HDFS would be too slow for the real-time computation workflow, but writing to MapRFS has some interesting possibilities. Because of its true read/write nature, MapRFS allows us to get rid of the queuing system and do publish-subscribe models within the data platform. Storm can then 'tail' a file to which it wishes to subscribe, and as soon as new data hits the file system, it is injected into the Storm topology. This allows for strong Storm/Hadoop interoperability, and a unification & simplification of technologies onto one platform.

This blog post was published March 06, 2013.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now