Dataware for data-driven transformation

Apache Spark

Unified Analytics Engine for Large-Scale Distributed Data Processing and Machine Learning


Getting Started with Apache Spark 2.x from Inception to Production

Apache Spark logo


Apache Spark is a powerful unified analytics engine for large-scale distributed data processing and machine learning. On top of the Spark core data processing engine are libraries for SQL, machine learning, graph computation, and stream processing. These libraries can be used together in many stages in modern data pipelines and allow for code reuse across batch, interactive, and streaming applications. Spark is useful for ETL processing, analytics and machine learning workloads, and for batch and interactive processing of SQL queries, machines learning inferences, and artificial intelligence applications.

The Power of Data Pipelines

Much of Spark's power lies in its ability to combine very different techniques and processes into a single, coherent whole. Outside Spark, the discrete tasks of selecting data, transforming that data in various ways, and analyzing the transformed results might easily require a series of separate processing frameworks, such as Apache Oozie. Spark, on the other hand, offers the ability to combine these, crossing boundaries between batch, streaming, and interactive workflows in ways that make the user more productive.

Spark jobs perform multiple operations consecutively, in memory, only spilling to disk when required by memory limitations. Spark simplifies the management of these disparate processes, offering an integrated whole – a data pipeline that is easier to configure, run, and maintain. In use cases such as ETL, these pipelines can become extremely rich and complex, combining large numbers of inputs and a wide range of processing steps into a unified whole that consistently delivers the desired result.

Spark pipelines and API diagram

Predicting Flight Delays with Apache Spark Machine Learning

Learn more about Apache Spark's MLlib, which makes machine learning scalable and easier with ML pipelines built on top of DataFrames.

In this video, you will see an example from the eBook Getting Started with Apache Spark 2.x.


Spark cons icon


  • Before Spark, there was MapReduce, a scalable, resilient distributed processing framework that enabled Google to index the exploding volume of content on the web across large clusters of commodity servers.
  • With MapReduce, iterative algorithms require chaining multiple MapReduce jobs together. This causes a lot of reading and writing to disk. For each MapReduce job, data is read from a distributed file block into a map process, written to and read from a file in between, and then written to an output file from a reducer process.
  • Spark jobs diagram
  • The MapReduce Java API is not easy to program with, although Pig and Hive make this somewhat easier.
  • MapReduce, Pig, and Hive are only for batch ETL, and data sources are limited to Hadoop.
Spark pros icon


  • Apache Spark began life in 2009 as a project within the AMPLab at the University of California, Berkeley. The goal of the Spark project was to keep the benefits of MapReduce's scalable, distributed, fault-tolerant processing framework while making it more efficient and easier to use. Spark is designed for speed:
    • Spark runs multi-threaded lightweight tasks inside of JVM processes, providing fast job startup and parallel multi-core CPU utilization.
    • Spark caches data in memory across multiple parallel operations, making it especially fast for parallel processing of distributed data with iterative algorithms.
  • Spark driver program on MapR nodes example diagram
  • Spark provides a rich functional programming model and comes packaged with higher level libraries for SQL, machine learning, streaming, and graphs.
  • Spark's Structured API provides the same API for batch and real-time streaming. Spark's architecture supports tight integration with a number of leading storage solutions in the Hadoop ecosystem and beyond, including Apache HDFS, MapR XD Distributed File and Object Store, Apache HBase, MapR Database JSON, Apache Kafka, and Apache Hive.



From log files to sensor data, application developers increasingly have to cope with streams of data. This data arrives in a steady stream, often from multiple sources simultaneously. While it is possible to store these data streams on disk and analyze them retrospectively, it is sometimes necessary to process and act upon the data as it arrives. Streams of data related to financial transactions, for example, can be processed in real time to identify – and refuse – potentially fraudulent transactions.


As data volumes grow, machine learning approaches become more feasible and increasingly accurate. Software can be trained to identify and act upon triggers within well-understood datasets before applying the same solutions to new and unknown data. Spark’s ability to store data in memory and rapidly run repeated queries makes it a good choice for training machine learning algorithms. Running broadly similar queries again and again, at scale, significantly reduces the time required to go through a set of possible solutions in order to find the most efficient algorithms.


Rather than running pre-defined queries to create static dashboards of sales or production line productivity or stock prices, business analysts and data scientists want to explore their data by asking a question, viewing the result, and then either altering the initial question slightly or drilling deeper into results. This interactive query process requires systems like Spark that are able to respond and adapt quickly.


Data produced by different systems across a business is rarely clean or consistent enough to be simply and easily combined for reporting or analysis. ETL processes are often used to pull data from different systems, clean and standardize it, and then load it into a separate system for analysis. Spark (and Hadoop) are increasingly being used to reduce the cost and time required for this ETL process.


Spark enables developers, data engineers, and data scientists to collaborate and combine SQL, streaming data, machine learning, and graph processing into modern data pipelines to rapidly access, transform, and analyze big data at scale.

Dev and engineers icon


  • Easier, faster data pipelines:
  • Develop and deploy applications that run 10-100x faster in production environments with in-memory processing of data
  • Build complex ETL pipelines that can speed up data ingestion and deliver superior performance
  • Spark SQL's Structured Data API simplifies the complexity of data access, transformation, and storage across distributed file systems, different file formats, streaming data, and NoSQL data stores
  • Combine event streams with machine learning to handle the logistics of machine learning in a flexible way by:
    • Making input and output data available to independent consumers
    • Managing and evaluating multiple models and easily deploying new models
Data scientists icon


  • Easier, faster time to insight:
  • Provides a uniform set of high-level machine learning pipeline APIs built on top of DataFrames to make machine learning scalable with the ease of SQL for data manipulation
  • Integrated distributed machine algorithms for classification, regression, collaborative filtering, clustering, dimensionality reduction, and frequent pattern mining
  • Leverage Spark and deep learning with external libraries including BigDL, Spark Deep Learning Pipelines, TensorFlowOnSpark, dist-keras, H2O Sparkling Water, PyTorch, Caffe, and MXNet

Free Hadoop Training: Spark Essentials

Get a glimpse of what free Hadoop on-demand training is like in this preview of the course "DEV 360 - Introduction to Apache Spark (Spark v2.1)."

If you're interested in this free on-demand course, learn more about it here.


Everything on One Cluster

Spark advantage on MapR diagram

Accessing Data In-Place

Diagram showing all data in one place on MapR Platform

A confluence of several different technology shifts have dramatically changed machine learning applications. The combination of distributed computing, streaming analytics, and machine learning is accelerating the development of next-generation intelligent applications, which take advantage of modern computational paradigms powered by modern computational infrastructure. The MapR Data Platform integrates global event streaming, real-time database capabilities, and scalable enterprise storage with Hadoop, Spark, Apache Drill, and other ML libraries to power this new generation of data processing pipelines and intelligent applications. Diverse and open APIs allow all types of analytics workflows to run on the data in place.

The MapR XD Distributed File and Object Store is designed to store data at exabyte scale, support trillions of files, and combine analytics and operations into a single platform. MapR XD supports industry-standard protocols and APIs, including POSIX, NFS, S3, and HDFS. Unlike Apache HDFS, which is a write once, append-only paradigm, the MapR Data Platform delivers a true read-write, POSIX-compliant file system. Support for the HDFS API enables Spark and Hadoop ecosystem tools, for both batch and streaming, to interact with MapR XD. Support for POSIX enables Spark and all non-Hadoop libraries to read and write to the distributed data store as if the data were mounted locally, which greatly expands the possible use cases for next-generation applications. Support for an S3-compatible API means MapR XD can also serve as the foundation for Spark applications that leverage object storage.

The MapR Event Store for Apache Kafka is the first big-data-scale streaming system built into a unified data platform and the only big data streaming system to support global event replication reliably at IoT scale. Support for the Kafka API enables Spark streaming applications to interact with data in real time in a unified data platform, which minimizes maintenance and data copying.

MapR Database is a high-performance NoSQL database built into the MapR Data Platform. MapR Database is multi-model: wide-column, key-value with the HBase API, or JSON (document) with the OJAI API. Spark connectors are integrated for both HBase and OJAI APIs, enabling real-time and batch pipelines with MapR Database:

  • The MapR Database Connector for Apache Spark enables you to use MapR Database as a sink for Spark Structured streaming or Spark Streaming.
  • The Spark MapR Database Connector enables users to perform complex SQL queries and updates on top of MapR Database, while applying critical techniques such as projection and filter pushdown, custom partitioning, and data locality.

MapR put key technologies essential to achieving high scale and high reliability in a fully distributed architecture that spans on-premises, cloud, and multi-cloud deployments, including edge-first IoT, while dramatically lowering both the hardware and operational costs of your most important applications and data.

We are very excited about the new features [in MapR], Spark structured streaming allows us to use advanced analytics on real-time oil well data while Drill allows us to explore the same data using SQL. This helps us make operational decisions faster.

Eric Keister, advanced analytics and emerging technologies manager at Anadarko