Mid-year Updates for Big Data Trends: Apache Kafka, Spark, Flink, Drill, and More

Contributed by

10 min read

In January, I made predictions about six big data trends for 2016 (“What Will You Do in 2016?”). Now we’ve reached the mid-and-a-bit-more year, so it’s a good time to check them out and see how well these predictions match what has happened so far in 2016, what is surprising about that, and what’s likely to come in the second half of the year.

As a spoiler, I actually predicted Pokémon GO….

Prediction 1

“You will come up with some innovative way to put big data to use that has not yet occurred to me.”

Yes, it’s true that I had no idea when I wrote this prediction that people of all ages would be walking around outside with their smartphones, “catching” virtual beasties. I am writing this post less than two weeks after the release of Pokémon GO in Australia, New Zealand, and the U.S. As of 5 days a(Pokémon)go, this product is reported to be the most active mobile game in the U.S. ever, and daily uniques have surpassed those of Twitter. So while I did not predict the game itself—or its fantastic initial impact—I did predict that you would surprise me. And you did.


In addition, I also predicted a huge upsurge in people putting streaming data to use in new ways AND a big presence for telecommunications in the big data arena. Pokémon GO comprises those predictions as well. Looking good so far…

Now, let’s revisit each of the predictions in a more serious way.

Prediction 2

“There will be explosive interest in streaming data and streaming analytics.”

Yes, yes, yes. As predicted, there’s a lot of excitement around the topic of streaming data, for both transport and processing. The popular Apache Spark project provides Spark Streaming to handle processing in near real time through a mostly in-memory, micro-batching approach. And as I suggested, there is increasing interest in the Apache Flink project, including outside of Europe where it originated. Flink is a streaming data engine that makes it possible to process data in real time or in batch mode, with high throughput and fault tolerance guarantees.

I also predicted there would be a rise in awareness of messaging tools with particular capabilities to support efficient streaming architectures. This shift is seen as increasing the popularity of the message transport known as Apache Kafka and of the new messaging system called MapR Event Store for Apache Kafka, which supports the Kafka 0.9 API but is integrated into the MapR Data Platform. Both have happened, at an even higher level than I would have thought.

Streaming data analytics

Message streams, shown here as horizontal cylinders, are the heart of a streaming architecture. Multiple applications (consumers) can share the streaming data without danger of cross-interference. Here we remind you of four popular data processors, although you would not likely be using them all at once. (Image © E.Friedman 2016)

Based on topics discussed at international big data conferences in the spring and early summer, such as the Strata conferences in San Jose and London, the Hadoop Summit conferences in Dublin and San Jose, Spark Summit in San Francisco, and the Berlin Buzzwords conference, streaming data is very much the rage. With co-author Ted Dunning, I’ve done a lot of book signings for the O’Reilly publication titled Streaming Architecture, and the people who show up are enthusiastically seeking information about how to design streaming projects and about the technologies that best support them.

The Berlin Buzzwords conference in June, for instance, particularly demonstrated people’s enthusiasm for streaming data, with 17 presentations on stream-related topics, including a keynote on streaming with Kafka-style message transport, and 9 technical talks on Apache Flink.

Prediction 3

“Businesses want practical ways to get to value faster… you are likely to try out Apache Drill some time in 2016 if your business has any need for SQL.”

Apache Drill has had a good year so far, with substantial advances in the April version 1.6 release and additional improvements in the latest June version 1.7 release. Some excellent Drill user stories have come to my attention. In particular, some businesses are initially adopting Drill as a way to greatly simplify their data preparation, especially where diverse data sources are involved.

It will be interesting to see what happens as more people discover Drill’s optimization and enhanced performance in the recent releases.

Prediction 4

“Many organizations will want a system with secure and reliable ways to maintain multiple data centers that can be quickly synchronized.”

This prediction was certainly on target. People are looking for better ways to securely share data, including among different data centers. Examples of businesses from telecommunications, utilities, the financial sector, IoT, and ad tech all are showing a strong interest in handling reliable and affordable data communication between multiple data centers, not only for insurance as part of disaster recovery, but also as a fundamental aspect of their business architecture. There’s a crossover with the first prediction about streaming data as well—people are surprised and pleased to learn that the message stream transport technology known as MapR Event Store (Formerly MapR Streams), which is integrated into the MapR Data Platform, is unusual in its ability to carry out geo-distributed replication of streaming data across data centers. This ability is in addition to MapR basic features for low-cost synchronized mirroring and direct table replication across clusters and across data centers.

Prediction 5

“Use of big data in the health care industry is poised for rapid expansion.”

I think this one was a partial miss for me. There are advances in healthcare that are making use of big data resources, but the “rapid expansion” I predicted was overstated. There’s a wealth of benefits for applying these techniques to the healthcare industry, both in research, quality of care delivery, and in the business of healthcare and insurance, but there are challenges in this industry including regulatory and privacy issues. Data scientist Joe Blue talks about these challenges in a short interview here.

The potential for growth in this area is still as great as I suggested, but so far it is still moving slowly.

Prediction 6

“Another area that will increasingly stand out in the big data space in 2016 is telecommunications.”

Once again, yes, yes, and yes. The enormous customer base and the wide-ranging crossovers between telecommunications and other industries using big data, such as real-time updates and discounts for shoppers, traffic and navigation applications, and virtual reality games, all make telecommunications a major force in big data use cases. Telecoms face massive amounts of streaming data, challenges of complex billing, a need for in-the-moment insights, and sophisticated machine learning models to be ready for usage surges or outages.

Some of these approaches involve anomaly detection, and the high level of interest in this topic was apparent at a Strata + Hadoop World London talk in June by MapR Chief Application Architect Ted Dunning. The talk was titled “Anomaly Detection in Telecom with Spark,” and the room was so heavily packed, including the aisles, that a guard at the door was turning people away. Ted talked about “the practical architecture as well as design patterns and some detailed algorithms for detecting anomalies in event streams.”

Missed Topic: Containers

In my 2016 predictions, I did not talk about containers, but I should have. Containers provide a wrapper around a software application to provide a known, stable environment such that the application will always run the same way. Docker is one of the leading providers of containers for big data applications, both for cloud and data centers. By using a containerization tool such as Docker, you can run multiple containers within one operating system, all on the same server. By comparison, virtual machines running on a server each carry their own operating system. This means that you can run more containers than VMs on a server, but it also means that the container is not quite as secure as a VM.

As businesses are increasingly focused on how to build and maintain effective big data applications in production in order to harvest the value in working with new data scales and sources, there’s naturally an increased interest in what containerization has to offer.

Looking Forward

We started this wrap-up with the prediction that you innovators in big data would surprise me with how you’d put these new technologies and techniques together to do something unexpected, and you did.

Now the question is, what will you surprise me with next?

Additional resources:

This blog post was published July 19, 2016.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now