WHY IS SPARK ON MAPR BETTER?
Everything on One Cluster
Accessing Data In-Place
A confluence of several different technology shifts have dramatically changed machine learning applications. The combination of distributed computing, streaming analytics, and machine learning is accelerating the development of next-generation intelligent applications, which take advantage of modern computational paradigms powered by modern computational infrastructure. The MapR Data Platform integrates global event streaming, real-time database capabilities, and scalable enterprise storage with Hadoop, Spark, Apache Drill, and other ML libraries to power this new generation of data processing pipelines and intelligent applications. Diverse and open APIs allow all types of analytics workflows to run on the data in place.
The MapR XD Distributed File and Object Store is designed to store data at exabyte scale, support trillions of files, and combine analytics and operations into a single platform. MapR XD supports industry-standard protocols and APIs, including POSIX, NFS, S3, and HDFS. Unlike Apache HDFS, which is a write once, append-only paradigm, the MapR Data Platform delivers a true read-write, POSIX-compliant file system. Support for the HDFS API enables Spark and Hadoop ecosystem tools, for both batch and streaming, to interact with MapR XD. Support for POSIX enables Spark and all non-Hadoop libraries to read and write to the distributed data store as if the data were mounted locally, which greatly expands the possible use cases for next-generation applications. Support for an S3-compatible API means MapR XD can also serve as the foundation for Spark applications that leverage object storage.
The MapR Event Store for Apache Kafka is the first big-data-scale streaming system built into a unified data platform and the only big data streaming system to support global event replication reliably at IoT scale. Support for the Kafka API enables Spark streaming applications to interact with data in real time in a unified data platform, which minimizes maintenance and data copying.
MapR Database is a high-performance NoSQL database built into the MapR Data Platform. MapR Database is multi-model: wide-column, key-value with the HBase API, or JSON (document) with the OJAI API. Spark connectors are integrated for both HBase and OJAI APIs, enabling real-time and batch pipelines with MapR Database:
- The MapR Database Connector for Apache Spark enables you to use MapR Database as a sink for Spark Structured streaming or Spark Streaming.
- The Spark MapR Database Connector enables users to perform complex SQL queries and updates on top of MapR Database, while applying critical techniques such as projection and filter pushdown, custom partitioning, and data locality.
MapR put key technologies essential to achieving high scale and high reliability in a fully distributed architecture that spans on-premises, cloud, and multi-cloud deployments, including edge-first IoT, while dramatically lowering both the hardware and operational costs of your most important applications and data.