January 11, 2016 | BY Dr. Kirk Borne
Are people in your data analytics organization contemplating the impending data avalanche from the internet of things and thus asking this question: “Spark or Hadoop?” That’s the wrong question!
The internet of things (IOT) will generate massive quantities of data. In most cases, these will be streaming data from ubiquitous sensors and devices. Often, we will need to make real-time (or near real-time) decisions based off of this tsunami of data inputs. How will we efficiently manage all of this, make effective use of it, and become lord over it before it becomes lord over us?
The answer is already rising in our midst.
1. The Fellowship of Things
Organizations have already been collecting large quantities of heterogeneous data from numerous sources (web, social, mobile, call centers, customer contacts, databases, news sources, networks, etc.). This “fellowship” of data collections is already being tapped (or it should be) to derive competitive intelligence and business insights. Our data infrastructures and analytics ecosystems are evolving through the acquisition of data scientists and big data science capabilities, to allow us to explore and exploit this rich fellowship of data sources. We can start now to use our existing data analytic assets (or to build them up) in order to become lord of the things before the IOT overruns Middle Earth (I mean… our middleware environments).
2. The Two Powers
Hadoop and Spark are not opposed to one another. In fact, they are complementary in ways that are essential for dealing with IOT’s big data and fast analytics requirements. Specifically,
Hadoop is a distributed data infrastructure (for clustering the data), while Spark is a data processing package (for cluster computing).
Clustering the data – Apache Hadoop distributes massive data collections across many nodes within a cluster of commodity servers, which is absolutely critical for today's huge datasets since otherwise we would need to buy and maintain hugely expensive custom hardware. Hadoop indexes and keeps track of where every chunk of data resides, thus enabling big data operations (processing and analytics) far more effectively than any prior data management infrastructure. The Hadoop cluster is easily extensible by adding more commodity servers to the cluster.
Cluster computing on the data – Apache Spark is a fast data processing package that operates on the distributed data collections that reside on the Hadoop cluster. Spark can hold the data in memory and carry out analytics much more quickly than MapReduce, which is the processing tool that traditionally came with Hadoop. The big difference is this: MapReduce operates in atomic processing steps, while Spark operates on a dataset en masse. The MapReduce workflow looks like this: read data from the cluster, perform an operation, write results (updated data) to the cluster, read updated data from the cluster, perform next operation, write next results to the cluster, etc. This is fine if your data operations and reporting requirements are mostly static, and you can wait for data processing to execute in batch mode (which is the antithesis of IOT workflows). Conversely, Spark completes the full data analytics operations in-memory and in near real-time (interactively, with an arbitrary number of additional analyst-generated queries and operations). The Spark workflow looks like this: read data from the cluster, perform all of the requisite analytic operations, write results to the cluster, done!
The Spark workflow is excellent for streaming data analytics and for applications that require multiple operations. This is very important for data science applications since most machine learning algorithms do require multiple operations: train the model, test and validate the model, refine and update the model, test and validate, refine and update, etc. Similarly, Spark is very adaptable in allowing you to do repeated ad hoc "what if" queries of the same data in memory, or to perform the same analytic operations on streaming data as they flow through memory. All of those operations require fast full access to the data. Consequently, if the data have to be re-read from the distributed data cluster at every step, this would be very time-consuming (and a complete non-starter for IOT applications). Spark skips all of those intermediate time-consuming read-process-write-index operations -- Spark performs all of the analytic operations at the minimal cost of one read operation.
Applications of Spark include real-time marketing campaigns, real-time ad selection and placement, online customer product recommendations, cybersecurity analytics, video surveillance analytics, machine log monitoring, and more (including IOT applications that we are only beginning to anticipate and envision). Some experiments have shown that Spark can be 10 times faster for batch processing and up to 100 times faster for in-memory analytics operations, compared to traditional MapReduce operations.
Because Hadoop is a data infrastructure and Spark is an in-memory multistep data processing package, the two are separable -- one does not require the other. Since Spark does not come with its own file management system, it needs to be integrated with a file system. The file system can be HDFS (the Hadoop Distributed File System) or any other cloud-based data platform. However, since Spark was designed as a fast analytic layer on top of the Hadoop data layer, it is more efficient and effective to use them together. That is a great value proposition since Spark comes with a rich set of algorithms (and more are on the way – it is currently the most active open source project within the Apache ecosystem). Apache Spark comes with its own machine learning library (MLlib), as well as graph processing (GraphX), SQL, and Spark Streaming applications.
One final consideration is failure recovery -- Hadoop is naturally resilient to system faults or failures since data are written to disk after every operation. Fortunately, Spark has similar built-in resiliency by having its data objects stored in Resilient Distributed Datasets (RDD). The RDD is distributed across the data cluster -- these data objects can be stored in memory or on disks, and RDD provides full recovery from faults or failures.
3. Return to the Things
So, how do the two powers (Hadoop and Spark) help us to become the lord of the things? A single converged data system for fast streaming data (such as the recently announced MapR Streams) gives us the one platform to rule them all. The platform’s ultimate strength against the IOT comes from the alliance of the two powers.
In summary, Spark completes Hadoop in a way that MapReduce only started. Spark uses fast memory (RAM) for analytic operations on Hadoop-provided data, while MapReduce uses slow bandwidth-limited network and disk I/O for its operations on Hadoop data. MapReduce was a groundbreaking data analytics technology in its time. Spark is the groundbreaking data analytics technology of our time.
The MapR Converged Data Platform brings all of the data allies together into one analytics army to master and rule the massive data science requirements of the internet of things. So, ask yourself the right question: can I afford to enter the IOT era without the twin powers of Hadoop and Spark?