Spark 2.0 Is Now in Developer Preview Mode on the MapR Platform

Contributed by

4 min read

There’s been a lot of buzz and high expectations in the big data community around Apache Spark 2.0 and how it will impact the development of data pipelines, streaming applications, machine learning algorithms and all of the other use cases that Apache Spark is enabling.

Good news—the wait is now over! You can now get your hands dirty with Spark 2.0 on the MapR Data Platform. Whether you’re a data engineer, developer, or data scientist, Spark 2.0 has a broad range of capabilities that you can take advantage of. This release is currently in developer preview mode, which means that it’s not recommended for production use, and is not supported beyond the community forum.

To get the latest and greatest documentation on installing, upgrading, configuring, and using Spark 2.0 Developer Preview with MapR, please check out our Apache Spark documentation. If you’re new to Spark, the MapR Sandbox provides the easiest way to get started with Spark.

What are the new capabilities in Spark 2.0?

Structured Streaming with Spark SQL Streaming

Spark SQL Streaming introduces the concept of repeated queries, wherein a particular query is executed repeatedly against every incoming batch of Dstreams (RDDs). Specific benefits that Spark SQL Streaming provides include:

  • The ability to perform interactive queries against live streaming data.
  • Introduces the concept of repeated queries, wherein a particular query is executed repeatedly against every incoming batch of DStreams (RDDs).
  • Output can now be aggregated in a stream for continuous applications.

So how does this benefit the development of analytical applications? With Spark on the MapR Platform, you can now build analytic applications that are external customer-facing applications. With Spark SQL Streaming, pre-computation of analytics in a continuous fashion can occur as the data is generated; this can then be served up on web applications by persisting the output in MapR Database.

Spark as a Compiler

Spark now acts as a compiler with whole-stage code generation, whereby once it parses the query, it understands what operations the user wants to perform and generates the code for these functions instead of retrieving them elsewhere. The two most important things to highlight here are:

  • Whole-stage code generation is provided by the second-generation Tungsten engine.
  • This eliminates the need for multiple JVM calls by flattening SQL queries into one single function evaluated as bytecode at runtime.

DataFrame APIs

  • Runs on the same engine as SparkSQL.
  • Allows access to data from a variety of different data sources.
  • Can run database-like operations or allow for passing in custom code.

We recently announced the MapR Platform including Apache Spark, which makes it easier for customers to start with Spark as their primary compute engine. This gives our customers a converged compute and storage engine for batch, analytics, and real-time processing that helps them build and deploy applications rapidly. Customers such as Terbium Labs have built cutting-edge applications with Spark on MapR. MapR also has Quick Start Solutions for use cases such as security log analytics, time series analytics, and stream processing that allow you to quickly get up and running with Spark as well as combine other components of the MapR Platform.

We would like to congratulate the Spark community on the 2.0 release and look forward to Spark 2.0 going GA on the MapR Platform in the coming weeks. In the meantime, we encourage you to start testing and experimenting with Spark 2.0 on the MapR Data Platform.

This blog post was published June 16, 2016.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now