6 min read
The well-known, open source project Storm is in the process of moving into the Apache Foundation group of open source software projects. This is a big step for Storm and for the community developing this already well-respected software.
You may know of Storm if you have an interest in real-time applications. Storm makes it easy to reliably process very large streams of data, so you may find it particularly useful if you have lots of data passing by quickly or when you need to instantly update a dashboard. For example, Storm is a good technology choice for analysis of sensor data or for call detail records (CDR) in telecom. Storm is scalable and fault tolerant, with benchmarks reported at more than a million tuples per second per tuples/sec/node. Another aspect that makes Storm appealing is that it integrates with queuing, database and big data technologies you may already use.
Storm was originally started by Nathan Marz when he worked at Back Type before that company was acquired by Twitter. After the acquisition, Twitter and other companies have used Storm for a variety of internal processes. Given that Storm was already an active and fairly widely adopted open source project, you may be surprised to know that for it to become an Apache project involves more than a single decision. In fact, the move involves lots of people and a fair bit of work.
“Joining Apache is a multi-step process,” says Ted Dunning, MapR’s Chief Application Architect, and one of five of people nominated as mentors for Apache Storm. The project champion for Storm at Apache is Doug Cutting. Together, the mentors and champion will facilitate Storm’s transition to Apache. “I’m pleased to have the opportunity to contribute to this outstanding project, even though I won’t be able to help much by actually coding. Storm fills an important gap in the Apache ecosystem.”
“Now that Storm’s proposal to join Apache has been approved, several things are happening to establish the incubator project,” Ted explained. These include
“Once established as an Apache incubator project, one of the main requirements to move toward graduation is to demonstrate the ability to build community. That’s one of the places that I hope to be able to help. This requirement for a vibrant community is one of the major requirements for Apache projects.”
Ted went on to describe another requirement for an incubator project to graduate to top-level status: more than one significant release. Storm has already had four significant code releases, but these were outside of Apache. “Apache sets a high bar on careful review of licensing, and getting this exactly right is one of the things that new incubator projects often struggle with,” he explained.
What is the advantage of joining the Apache? Reputation is one benefit – companies know that projects in Apache must meet rigorous standards with respect to process and that provides confidence in the software. Another advantage is that the requirement for a strong community means that Storm is likely to be able to last longer than the attention span of any single developer. Both of these factors are crucial for an open source project to get wide-spread adoption, especially in commercial settings.
In some cases raw data is not what you want to store, so you might use Storm to aggregate, compress or reformat input data before you actually store anything. Storm can be used to change the format of input data to use space conserving formats like Parquet or ORC files. Storm also could be used to create summaries from streaming data, such as unique counts/minute or totals every minute, with output stored in HBase or MapR M7 tables, directly in the MapR Distributed File and Object Store (MapR XD). Storm’s focus on real-time is a good match with the real-time nature of MapR XD, which works somewhat differently than HDFS of other Hadoop distributions.
One of the challenges with Storm is to use it at scale with Apache Hadoop applications. Hadoop’s heritage is batch processing and this is reflected in the way that HDFS doesn’t support real-time reading of data as it is being written. This can make it difficult to efficiently marry the real-time streaming properties of Storm with Hadoop.
MapR’s distribution for Hadoop has been re-engineered to work differently, with some special capabilities in the file system layer. MapR XD is fully read/write and real-time which makes it possible to use the MapR XD directly as a real-time queue for incoming data. This makes it possible to stream data directly to the cluster to be processed by Storm without having to use an intermediate step on a separate server.
For more information about Apache Storm, see the incubator proposal at http://wiki.apache.org/incubator/StormProposal;
Consider the O’Reilly book Getting Started with Storm by Jonathan Leibiusky, Gabriel Eisbruch, and Dario Simonassi;
See slides for a talk on “Real-time Storm + Hadoop” delivered to the Bay Area Storm user group by Ted Dunning in June 2013.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.