Apache Storm on Hadoop

Industrial sensors, data center devices, consumer wearables, and social media networks are typical sources of data streams that can be potentially harnessed to gain real-time insights about a business and its consumers. Falling squarely in the bucket of big data problems, stream processing can be accomplished using purpose-built stream processing engines such as Apache Storm on Hadoop.

Apache Storm is an open source, scalable, fault-tolerant, distributed real-time computation system. Storm advantages include:

  1. Fault tolerance - where if worker threads die, or a node goes down, the workers are automatically restarted;
  2. Scalability - where throughput rates of even one million 100 byte messages per second per node can be achieved and
  3. Ease of use in deploying and operating the system.

The Apache Storm project graduated from incubation to Apache Top-Level Project recently.

Storm on MapR

MapR delivers an enterprise-grade platform to build high performance streaming applications. Beyond the usual Storm connectors, MapR uniquely supports an NFS based spout for Storm, allowing users to deploy a simplified streaming solution directly off of the Hadoop platform without having to deploy a separate queueing cluster.

Here are some Storm solutions deployed by MapR customers:

Clickstream Analysis
Users can augment online customer experience with targeted product placements based on both historical customer behavior as well as real-time clickstream data from online customers. The Storm bolts are designed in such a way that aggregated counters on streaming clicks trigger the necessary product placement in the web application that serves the end user.

Real-time Threat Detection
A large managed security services provider augments its cloud security service with real-time threat detection using Storm on a MapR cluster. As sensor data and logs from intrusion protection systems get collected centrally, the very first action that is taken on these streams is in-memory processing that validates if the incoming streams have any known threat footprints that need to be flagged immediately.

Enhanced Network Service Quality
A large telecom company is using Storm running on their MapR cluster to make real-time service quality decisions. Using Storm and MapR together allows real-time systems to integrate with batch systems to analyze long-term trends.

Here is a blog post that showcases real-time Twitter fire-hose analysis using Apache Storm on Hadoop

Storm Basics

Storm is built on abstracts that are easy to understand and implement. There are three main abstractions called spouts, bolts, and topologies.

Spouts represent a streaming source and typically read from a queueing system such as Kafka or using an NFS-based spout source, an option available with the MapR platform. They can also work directly with external sources such as the Twitter streaming API.

A bolt is where the computation logic sits, and can be used to processes any number of input streams while producing any number of new output streams. Functions, filters, streaming joins, streaming aggregations can all be implemented here.

A topology is a network of these spouts and bolts, with each edge in the network representing a bolt subscribing to the output stream of some other spout or bolt.

Please refer to the Storm documentation pages for more information on building applications using the Apache Storm architecture.


Tech Brief: Stream Processing with MapR

Download Sandbox for Hadoop

GitHub - MapR

MapR Developer Central