10 min read
Apache Spark, a powerful general purpose engine for processing large amounts of data, has seen a rapid increase in its adoption since its release. Recognizing its impact very early on, MapR has supported and invested in Spark as part of our Hadoop distribution to enable enterprises to build applications with Spark and deploy it in production in a reliable manner. On June 6, 2016, we announced a separate distribution for the complete Spark stack: the MapR Platform including Apache Spark. Customers can now leverage this new and unique Spark-focused distribution to accelerate their big data mission.
Enterprises across different industries have started using Spark as a unified computing engine for many of their critical use cases. The MapR Platform including Spark addresses big data challenges and requirements with the following components:
Data Management Platform: A reliable and enterprise-grade converged data management platform that eliminates silos and minimizes data movement, therefore accelerating the insight-to-action cycle. The MapR Data Platform includes a Distributed File and Object Store, scalable NoSQL, and event streaming, and at the same time provides the following features:
Workflow Management: A scheduler system to manage a large number of Spark jobs as directed acyclic graphs of actions triggered by time and data availability.
Quick Start Solutions (QSS): A set of purpose-built solutions that allows you to jumpstart your most valuable and critical use cases for Spark.
Notebook*: An intuitive web-based notebook that data analysts and scientists can use to perform interactive analytics and visualization.
Spark has received a lot of attention lately because of its easy programming paradigm and faster performance as compared to MapReduce in Hadoop. It also has a growing ecosystem of projects, which lets it handle a wider range of big data workloads.
With the emergence of Spark as a unified computing engine, developers can perform ETL and advanced analytics in both continuous (streaming) and batch mode either programmatically (using Scala, Java, Python, or R) or with procedural SQL (using Spark SQL or Hive QL).
With MapR converging the data management platform, you can now take a preferential Spark-first approach. This differs from the traditional approach of starting with extended Hadoop tools and then adding Spark as part of your big data technology stack. As a unified computing engine, Spark can be used for faster batch ETL and analytics (with Spark core instead of MapReduce and Hive), machine learning (with Spark MLlib instead of Mahout), and streaming ETL and analytics (with Spark Streaming instead of Storm).
Enterprises that want to use Hadoop tools and Spark can get both. With this new distribution, we simply give customers a choice to start with Spark. You can add Hadoop tools on top of Spark for a very attractive incremental price. The end result (and price) is actually no different than starting with Hadoop and adding Spark. So you can enjoy the benefits of the MapR Platform including Spark while also running Hadoop tools like MapReduce, Hive, Pig, and Mahout.
One thing to note: Spark SQL has a dependency on the Hive metastore for retrieving table schema information and accessing temporary tables stored in the Hive metastore. Support for this Hive metastore is included as part of the MapR Platform including Spark, without the need for the full Hadoop module. A similar level of support is also included for enterprises who choose to run Spark on YARN.
The MapR Platform including Spark includes web-scale storage (via MapR XD) that exposes both the HDFS and NFS interfaces, but what if you need key-value, wide column, and/or document NoSQL databases? And what if you need a global publish-subscribe event streaming engine? With this Spark distribution, you always have the option to add additional modules such as MapR Database (NoSQL) and MapR Event Store (event streaming). These will provide you with real-time and operational analytics capabilities. In addition, you also have the option of adding Apache Drill for BI/ad-hoc/exploratory analytics.
All Hadoop use cases can be addressed including:
If you’d like to learn more about developing applications with Spark, MapR provides free Apache Spark on-demand training courses:
If you’d like to take Spark for a test drive and experience all the powerful features that are part of the complete distribution, try out the MapR Sandbox with Apache Spark.
Have you completed the MapR Spark certification courses and the test drive successfully? Interested in jump starting your Spark application and accelerating at high gear? Check out our Quick Start Solutions (QSS) for Spark that includes:
Looking for a one-stop destination for all things related to Spark? Explore our Apache Spark Resources & Product Information and free Apache Spark resources pages (ebook, videos, whitepapers, and more).
*Notebook for the MapR Platform including Spark will be added in later releases.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.