Apache Spark is a powerful execution engine for large-scale parallel data processing across a cluster of machines, which enables rapid application development and high performance. With Spark 2.0 and later versions, big improvements were implemented to make Spark easier to program and execute faster:
Spark SQL and the Dataset/DataFrame APIs provide ease of use, space efficiency, and performance gains with Spark SQL's optimized execution engine.
Spark ML provides a uniform set of high-level APIs, built on top of DataFrames. Having ML APIs built on top of DataFrames provides the scalability of partitioned data processing with the ease of SQL for data manipulation.
GraphFrames extends Spark GraphX to provide the DataFrame API, making graph-parallel and data-parallel computations easier to program and more efficient.