A couple of weeks ago I had the fun of meeting several hundred of the several thousand people who gathered in the San Jose Convention Center for Strata + Hadoop World 2017. Luck was with me from the start – I not only found a parking place each day, I also was able to find it again at the end of each day. That combined with good conversations, interesting presentations and the general buzz of big data made for a good event. To give you a taste of what people were curious to know, here are some of my impressions. If you want more details on any of the topics, feel free to contact me so I can introduce you to the speakers I mention (email@example.com | Twitter \@Ellen_Friedman).
My experience at the conference started with a new tutorial track on Tuesday called Data 101, which was well attended and well received. The first speaker, Edd Wilder-James, VP of Strategy at Silicon Valley Data Science, focused on a variety of technologies including attractive tools for deep learning and ended with an engaging demonstration of notebooks for data science, using as an example the challenge of how to recognize a moving train in video.
I spoke next, to address the question “Why Stream?” and explore the surprising advantages of a steam-first architecture that include but also go beyond real time or low latency applications. At the heart of a successful streaming architecture is a message transport technology with the right capabilities: Apache Kafka and MapR Streams are two innovative technologies that meet the requirements well.
With the right messaging technology at the heart of stream-1st architecture, you can support several classes of use cases. (A) Low latency insights such as updating a real time dashboard. (B) Current status of a particular subject of interest, such as patients’ electronic medical records, are stored in a database or search document. (C) If the messages are durable, you can maintain a long-term history of the events, in this case perhaps an event-by-event record useful for insurance inquiries. Figure from Chapter 1 of Streaming Architecture by Ted Dunning & Ellen Friedman © 2016, used with permission
The Data 101 speaker who followed me was Jim Scott, Director of Strategy at MapR Technologies. Jim talked about the growing interest in cloud computing and how to decide which use cases are best run in cloud vs on-premise or when a hybrid cloud-on-premise architecture is advantageous, especially when you leverage cloud as an infrastructure.
Jim Scott speaking in the Data 101 tutorial track at Strata San Jose.
A recurring secondary theme shared by these three initial Data 101 presenters is the role of Hadoop in modern big data systems. We three speakers used different analogies but all made the point that while Hadoop has been a pioneer in big data and Hadoop use cases are still of interest, it this is just one of many current approaches. Hadoop is one of many workloads on a modern big data platform.
Cloud architecture was also the topic of a presentation on Wednesday by Dale Kim, Senior Director of Industry Solutions at MapR. Dale talked about the requirements for efficient global cloud processing, including the advantages of being able to run the same application code everywhere, with reliability and security across the global system. Having a global single namespace is important in this regard as well as continuous coordinated data flow, location awareness, strong consistency and omni-directional replication.
Dale had a busy schedule at the conference. He also presented a 10- minute Solutions Showcase talk in the O’Reilly booth on streaming and microservices and how they were used in a financial services solution developed by MapR to handle the fast data involved in stock trades.
A financial services solution that leverages streaming plus a microservices approach. Image from slide deck of Dale Kim, used with permission.
Dale Kim talking to crowd at the MapR booth about native JSON support in MapR-DB, the NoSQL database that is engineered into the MapR Converged Data Platform, and describing how the platform enables operational and analytic applications
Geo-distributed data and computation was another topic that drew attention. I talked with many people about this, in short presentations at the MapR booth and while signing copies of a new O’Reilly data report “Data Where You Want It: Geo-Distribution of Big Data and Analytics” written with my co-author Ted Dunning, Chief Application Architect at MapR. This report discusses the importance of multi-master table and stream replication as part of the capabilities that enable an effective global data fabric, both on premise, in cloud or with a hybrid architecture. We also describe how to run stateful applications in essentially stateless containers and provide example use cases of globally distributed systems.
You can download a free copy of this O’Reilly report via the MapR website at this link: https://mapr.com/geo-distribution-big-data-and-analytics/
Streaming data – how to collect it, deliver it, and analyze it – was a topic that attracted a lot of interest at Strata. One key idea that many people asked about was how streaming data transport can serve as the lightweight connection between microservices. This idea was explored in a Wednesday session titled “Machine Learning and Microservices” presented by Nitin Bandugula, Director of Professional Services at MapR. He talked about how pervasive machine learning at scale has become, in categories of use cases that include marketing optimization and targeting, risk detection and prevention and operational intelligence.
Nitin also explained how a message transport that decouples producers from consumers, as does MapR Streams, can act as the connectors between microservices and how this isolation approach makes implementation of new machine learning models more flexible and more efficient.
In this diagram, horizontal cylinders represent message streams while the gear icons stand for isolated microservices. From presentation slide deck of Nitin Bandagula, used with permission.
You can see how the insertion of a message stream in this architecture for a credit card fraud detection project provides isolation and flexibility. By making the data available to multiple independent consumers via a stream, instead of directly updating the last card use database, you provide the option for multiple projects to take advantage of this data without interfering with each other. Furthermore, this streaming-based microservices design also makes it easier to deploy and test new fraud detection models for this same project without interrupting the original model being used in production.
In this microservices style approach, a message stream (shown as a horizontal cylinder in this diagram) handles data related to credit card activity. This card activity data not only can be used to update the database of last card use that is needed for the fraud detection model to function, it is also available to other consumers. Diagram is from Figure 6-3 in the O’Reilly book “Streaming Architecture” by Ted Dunning and Ellen Friedman © 2016, used here with permission.
Stream processing is also a topic of great interest. I got a number of inquiries about Apache Flink, a stream-processing engine that handles realtime or batch processing efficiently at scale. This open source project is better known in Europe, but it’s growing in popularity in North America as well. Jamie Grier, Director of Applications Engineering at data Artisans, presented a Flink session on Thursday at Strata and he and his colleagues were at a data Artisans booth to answer questions at the conference. I was also pleased to see their CEO, Kostas Tzoumas, who co-authored a short book on Flink with me last year. The Flink community is gearing up for a Flink Forward conference in San Francisco 10 – 11 April 2017, the first time it has been held in North America.
Apache Flink is an open source engine for stream processing at scale. It can also be used for batch processing. You can read this book online at https://mapr.com/ebooks/intro-to-apache-flink/
Machine learning and deep learning are topics of widespread and growing interest. A big audience showed up for the playfully titled session “Tensor Abuse in the Workplace” presented by MapR’s Ted Dunning on Wed afternoon. His talk expanded on the idea expressed in his abstract that while “tensors are the latest fad in machine learning… there is real content beyond the buzzword.” Ted gave a simple but elegant explanation of how tensors work, why they are especially well suited for computation on GPUs but also why they work in other situations as well. Finally, Ted showed an example of “Tensor Abuse” when he showed how the strengths of TensorFlow (tensors, automatic differentiation and good optimizers) could be used to solve a problem that had nothing to do with machine learning. The example made it clear that the tensors and associated techniques have much broader applicability than just machine learning.
Tensors at work in an example presented by Ted Dunning during his machine learning session at Strata San Jose.
And that brings me to the last sample of the conference, ending at a beginning with a note on the keynote Ted presented Wednesday morning. He challenged the audience to think about how the Internet has been turned upside down in the Internet of Things, as shown in this diagram (used with permission)
He went on to describe a new way to deal with the massive volume of data coming from IoT sensors, MapR Edge. This small footprint cluster of 3 to 5 nodes is designed to sit right next to data producing devices. This extends processing with security to the IoT edge, an important option especially for situations where data volume and rate are massive but latency must be kept low.
Ted closed the short presentation with a surprise live demo: He had a small MapR Edge cluster with him. The stage had been set up with motion sensors. So Ted quickly jumped up and down while huge monitors displayed the data coming in from these IoT motion sensors, in real time.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.