April 22, 2014 | BY Dr. Kirk Borne
A while back, I presented a Big Data Glossary: A to ZZ. In separate articles, I discussed some of the different entries in the glossary:
- K (K-anything in Data Mining) in “The K’s of Data Mining – Great Things Come in Pairs”
- P (Profiling) in “Data Profiling – Four Steps to Knowing Your Big Data”
- R (Recommender Engines) in “Design Patterns for Recommendation Systems – Everyone Wants a Pony” and “Personalization – It’s Not Just for Hamburgers Anymore”
- S (Support Vector Machines) in “The Importance of Location in Real Estate, Weather, and Machine Learning”
- ZZ (Zero bias, Zero variance) in “Statistical Truisms in the Age of Big Data”
Here, I focus on H (Hadoop), which is the evolving but increasingly standardized big data computing platform.
Hadoop is more than a compute engine, and it is more than a computing paradigm (such as MapReduce). Hadoop is now a full ecosystem for storing, accessing, querying, analyzing, mining, and processing big data. The Hadoop ecosystem includes the storage and processing platform, scripting language, database, analytics tools, query language, workflow manager, and more. Many of those components of the Hadoop stack are Apache open source projects. The MapR Data Platform includes more than 20 of these components for batch, stream, graph, and real-time data processing.
The pace of development in the Hadoop ecosystem is faster than most of us can keep up with. So, to help understand why we need so many components and what their particular roles are, I summarize here some of the most significant pieces
MapReduce – this is the original programming model for Hadoop (which is an open-source implementation of MapReduce), which is used in parallel processing of massive data on clusters of commodity servers. As its name implies, MapReduce carries out two basic compute functions: Map and Reduce. First, the Map function takes the input task, divides it up and distributes the work across the data nodes. This is a “divide and conquer” approach that works very nicely on many big data problems. The additional advantage of the Map function is that it satisfies the mantra in big data computing: “move the computation to the data” (which is much more efficient and contrary to traditional computing methods where the data are brought to the computation). Second, the Reduce function collects the responses to all of the sub-problems that were executed on the slave nodes and then combines them in appropriate ways to compute the desired final answer to the problem.
HDFS – this is the Hadoop Distributed File System that was specifically designed to run on large clusters of commodity servers, for the purpose of crawling the web. Because HDFS wasn’t designed to play a major role in an enterprise data platform it has many innate weakness and shortfalls. To learn more about HDFS and the innovative MapR Data Platform that solves these problems, go here.
HBase – this is a NoSQL database that runs on Hadoop. HBase is scalable - allows distributed storage (across many cluster nodes), provides lower latencies, and has the advantage of being in a familiar table-like data structure. It is well suited for sparse data sets, which are common in many big data use cases. HBase – in simpler terms - provides random real-time read/write access to big data!
Hive – this is a data warehousing solution for Hadoop, providing ETL-like capabilities (for Extract, Transform, and Load tasks) for large datasets residing in distributed storage. Tables are stored as flat files on Hadoop. Hive can be accessed through HiveQL (which is a variant of SQL), which translates inputs into MapReduce jobs.
Pig – this is a data processing and data workflow language for Hadoop, allowing higher level function calls than native Hadoop, thereby facilitating large-scale processing with reduced attention to the intricacies of writing Java-based Hadoop function calls. Pig is useful for building data processing pipelines or manipulating the raw data stream (not stored in the “data warehouse”).
YARN (Yet Another Resource Negotiator) – this is essentially MapReduce 2.0. YARN is a much better Hadoop job manager than MapReduce, by separating the two major tasks of resource management and job scheduling/monitoring.
Spark – this is a new addition to the Hadoop family. Spark is a fast general purpose engine for large-scale data processing – putting the “spark” in such tasks by allowing parallel, complex, interactive, in-memory calculations on big data. Spark includes several sub-projects for interactive querying, machine learning, graph as well as stream processing.
Drill – this is a tool to use for web-scale data analytics = interactive ad hoc analysis at scale! Drill is a distributed execution engine for interactive queries to “drill” into your big data, enabling queries across billions of data records. Drill provides SQL query capabilities on big data and works with both schema-driven as well as schema-less datasets.
These and many more components and capabilities of the Hadoop ecosystem are included in the MapR Distribution, the leader in Big Data Hadoop Solutions, as shown in the diagram of the MapR Data Platform. More information can also be found in the free ebook, The Executive's Guide to Big Data & Apache Hadoop.