7 min read
Editor's Note: Download the free O'Reilly ebook, "Streaming Architecture: New Designs Using Apache Kafka and MapR Streams" to learn how event streaming can make your business more productive.
Can we agree at the outset that modern businesses rely heavily on data to make critical decisions, and the ability to make decisions in real time is very valuable? Good.
So what keeps us from always making decisions in real time? Typically, some up-front data processing is necessary, which slows down the opportunity for real-time decisions. For example, extract/transform/load (ETL) is a necessary step in getting all relevant data to the analysts, and it is mostly done using batch processing methods. Operational data is extracted from sources like relational databases and text files, transformed into a common format that analytics tools can handle, and then loaded into the analytical database or warehouse for decision support analysis.
This process can take hours. In fact, because of the need for human involvement, it often takes days. By the time managers get reports the information is out of date. That has business consequences. For example, if the analytics indicate that a clothing store is about to have a big surge in demand for parkas because of sudden drop in temperature tomorrow, managers who rely on batch processing may not have time to stock inventory and optimize floor space in time, and risk losing sales.
That’s where event streaming comes in. Event streaming is revolutionizing how we analyze data.
Big data is created one event at a time, and it should be analyzed as such. Event stream processing was built on this approach to real-time data analysis, and we’ve only scratched the surface of its potential.
Event streaming is real-time analytics using streaming data. Typical sources of that data might include application logs, system logs, sensors, security cameras, Twitter conversations and other inputs that constantly generate information. Running real-time analytics against streaming data can reveal all sorts of interesting opportunities.
For example, data streams from sensors on the factory floor can detect hot spots that may indicate that a piece of equipment is about to fail. Repair crews can be dispatched to fix the problem before that failure brings down the assembly line.
Or take computer security. Real-time analytics engines can read data streaming in from system logs in real time to identify abnormal traffic patterns, unknown IP addresses, or bandwidth surges that could indicate that an attack is under way. Batch processing isn’t fast enough to give you those insights in time to fend off an attack.
Event streaming doesn’t replace batch analytics but provides different value. It’s used in scenarios in which rapid response makes a difference and in which batch analytics is impractical. It complements batch processing by providing the ability to make rapid decisions, while batch processing can continue to be utilized for deeper analysis at a lower frequency.
Streaming requires a different approach to capturing and synthesizing data. Batch processing methodologies used for ETL are too slow, so most streaming systems use a publish-and-subscribe approach like MapR Event Store and Apache Kafka. Publish and subscribe is exactly what it says: The data source is the publisher and the analytics engine is the subscriber. Every new event the source publishes is immediately sent to the subscriber for processing. Tools like MapR Event Store and Kafka can handle extremely large amounts of data from thousands of sources. MapR Event Store provides a real-time, guaranteed events streaming capability that many enterprises find invaluable.
Event streaming also requires a purpose-built processing engine, and there are some good ones from which to choose. You’ve probably been hearing a lot about Apache Spark recently. It’s an analytics engine that does all its processing in memory rather reading and writing to disk. Spark Streaming is the component of Spark that’s optimized for handling real-time data. Two other options for streaming analytics are Apache Flink and Apache Storm. Each has its nuances and specialty areas, but they all basically do the same thing.
You don’t want to use event streaming in every situation. Memory and bandwidth are expensive, and Hadoop works just fine for many analytics applications. Typically, you’d apply streaming to tasks with relatively urgent decision-making needs where speed is an important factor.
To use another retail example, you could use streaming to monitor point-of-sale data to tip you off that a lot of people are buying pizza today so that you can place an additional order with the bakery before supplies run out. But you would use batch analytics to pore over three years of sales data to discover that pizza sales are highest on Thursdays and so Wednesday orders should be a little higher. You’ll sell more pizza in both cases, but for different reasons.
Insights from event streams can also be linked to programmatic responses that make response times even faster. Think of it like a circuit breaker on steroids. When the engine detects an anomaly like an overheating machine on the factory floor, it can dispatch a work order with time-stamped information about the location of the machine and the details of the temperature fluctuation. This kind of detail can help repair crews better diagnose the problem and plan a response. That means fewer delays and errors.
Utilities provide another example of how programmatic responses to streaming analytics can pay off. Many power companies are now installing remotely controlled thermostats in their customers’ homes and business. When electricity demand reaches certain peak levels, the thermostats can be instructed to raise temperatures a couple of degrees to reduce stress on the system. The utility avoids building new infrastructure and customers get a rate break for participating.
You can even create new business differentiators. A MapR customer in ad-tech is utilizing streaming to collect real-time data from all of its global data centers and bring it to the HQ. This means their customers can monitor the performance of their ads in real-time. Giving their customers the dashboards to monitor global ad performance and the ability react in real-time sets this ad-tech company apart from their competition.
If time is of the essence in any aspect of your business, and it should be, then you should take a look at how event streaming can transform the way you run your business.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.