Dataware for data-driven transformation

Getting Started with Apache Spark 2.x

from Inception to Production

Apache Spark is a powerful execution engine for large-scale parallel data processing across a cluster of machines, which enables rapid application development and high performance. With Spark 2.0 and later versions, big improvements were implemented to make Spark easier to program and execute faster:

  • Spark SQL and the Dataset/DataFrame APIs provide ease of use, space efficiency, and performance gains with Spark SQL's optimized execution engine.
  • Spark ML provides a uniform set of high-level APIs, built on top of DataFrames. Having ML APIs built on top of DataFrames provides the scalability of partitioned data processing with the ease of SQL for data manipulation.
  • GraphFrames extends Spark GraphX to provide the DataFrame API, making graph-parallel and data-parallel computations easier to program and more efficient.

With in-depth use cases and code examples, in this 2.x update to the Getting Started with Spark ebook you'll learn:

  • Spark 101: What It Is, What It Does, and Why It Matters
  • Datasets, DataFrames, and Spark SQL
  • How Spark Runs Your Applications
  • Demystifying AI, Machine Learning, and Deep Learning
  • Predicting Flight Delays Using Apache Spark Machine Learning
  • Cluster Analysis on Uber Event Data to Detect and Visualize Popular Uber Locations
  • Real-Time Analysis of Popular Uber Locations Using Apache APIs: Spark Structured Streaming, Machine Learning, Kafka, and MapR-DB
  • Predicting Forest Fire Locations with K-Means in Spark
  • Using Apache Spark GraphFrames to Analyze Flight Delays and Distances


Complete the form to receive your copy today!