MapR Platform Including Spark

The MapR Platform including Spark consists of the latest innovations from the Spark open source Community.

MapR Platform not only provides advanced high availability (HA) and data protection features such as resilience upon multiple node failures, snapshots, and mirroring for disaster recovery (DR), but also enables seamless Spark access and data management capabilities through industry-standard interfaces such as NFS and ODBC.

MapR Breakthrough Innovations

  • Performance-optimized architecture for faster data processing and analytics
  • Architecture designed specifically for high availability across all cluster operations
  • Automatic disaster recovery through mirroring to synchronize data across clusters
  • Direct Access NFS™ for real-time data access to Spark data
  • Distributed metadata to support trillions of files in a single cluster
  • Comprehensive security controls to protect sensitive data
  • Consistent snapshots for accurate point-in-time recovery
  • MapR Heatmap™ for instant cluster insights
  • MapR volumes for easier policy management around security, placement, retention, and quotas
  • Integrated NoSQL and event streaming for advanced real-time capabilities

You can try the MapR Sandbox, install MapR on an on-premise cluster, or deploy it in the cloud.

Apache Spark is a general-purpose engine for large-scale data processing. It supports rapid application development for big data and allows for code reuse across batch, interactive, and streaming applications. Apache Spark delivers in-memory processing for big data and enables faster application development. The most popular use cases for Apache Spark include building data pipelines and developing machine learning models. MapR is the choice for production Spark applications.

The MapR Platform including Spark consists of the complete Spark stack engineered to support advanced analytic applications, along with patented innovations in the MapR Platform, plus key open source projects that complement Spark. This enables advanced analytics including batch processing, machine learning, SQL, and graph computation. Because Spark runs seamlessly on MapR, it benefits from the platform’s patented enterprise-grade features such as web-scale storage, high availability, mirroring, snapshots, NFS, integrated security, global namespace, etc.

Support for the Complete Spark Stack

MapR was the first in the industry and remains the only one to support the entire Spark stack. This includes Spark SQL, Spark Streaming, MLlib, GraphX, and SparkR. These components enable customers to develop and deploy the widest range of use cases with Spark including building data pipelines and developing advanced analytical applications leveraging machine learning.
Analytics on Consistent Data. The MapR Platform including Spark enables data scientists to perform analytics on consistent data in both development and production environments through features such as mirroring and consistent snapshots.
Secure Multi-Tenant Applications. The MapR Platform including Spark enables development of reliable and secure multi-tenant applications.
Run Streaming & NoSQL Workloads Together. The MapR Platform including Spark enables the development of streaming and NoSQL applications on a single cluster. By using Spark Streaming, MapR Event Store, and MapR Database together, real-time operational applications can be developed that require data ingestion at high speeds.
Use Cases Faster Batch Applications. You can now develop and deploy batch applications that run 10- 100X faster in production environments with in-memory processing of data. Quantium uses Spark with MapR to decrease processing time by 92%, which represents a 12.5X increase in performance
Complex ETL Data Pipelines. You can leverage the Spark stack to build complex ETL pipelines that can speed up data ingestion and deliver superior performance. Razorsight leverages Spark with MapR to build a more efficient and cost-effective data pipeline which enables them to deliver cloud-based predictive analytics faster to their mobile and telco operators.
Advanced Analytics. You can leverage MLlib and GraphX to develop applications that combine the power of machine learning with graph technology. This can enable faster application development, and can enable data scientists to test new hypotheses faster. Novartis uses Spark with MapR to integrate and analyze a variety of data to accelerate drug research.
The MapR Platform including Spark

product spotlight

MapR has added significant innovations to improve Spark performance, reliability, and flexibility.


Standards-based file access. Unlike other distributions, MapR provides true Network File System (NFS) capabilities. MapR Direct Access NFS™ lets you access Spark like a standard file system, to copy data in and out easily at high rates, or to access Spark

data using common command line tools and desktop applications. The optional add-on MapR POSIX Client provides authenticated NFS access from remote nodes, along with over-the-wire compression and parallel access to boost throughput.

Industry standards.

MapR fully supports additional industry-standard APIs, including ODBC/JDBC, LDAP, Kerberos, HBase, HDFS, NFS, and more.


Kerberos and LDAP integration. MapR supports authentication services via Kerberos and/or LDAP. Access control. Data is secured using standard Unix file permissions and advanced role-based access control expressions (ACEs).

Native authentication. MapR also offers a standards-based authentication system as a simpler alternative to Kerberos that leverages Linux Pluggable Authentication Modules (PAM) to provide the widest registry support. Comprehensive auditing. MapR auditing logs help to analyze user behavior as

well as to meet regulatory compliance requirements. MapR uses the JSON format to log accesses at the administrative, authentication, database, and file levels. Performant wire-level encryption. MapR encrypts data sent between nodes and applications to ensure data privacy, using Intel AES-NI capabilities where available.


Multi-tenancy. MapR supports multitenancy beyond the capabilities in YARN via advanced resource management and control to let distinct user groups, data sets, and applications coexist in isolation in the same cluster. Security. MapR authentication and authorization controls provide another

level of user and data isolation. Volumes. MapR supports the logical grouping of files and directories on which policies (permissions, replication factors, quotas, etc.) can be set. ExpressLane. MapR avoids starvation of small jobs by letting them run even when the cluster is busy with large jobs.

Job placement control. Even beyond YARN, MapR manages resources with label-based job placement to specify which nodes can run the specified job. Data placement control. Configure the cluster topology to define on which nodes specific data is placed for performance, security, and optimal utilization purposes.

Performance and Scalability

Customers can reduce their data center footprint with the MapR performance advantage by deploying as few as one-third the servers of other distributions

A MapR cluster can scale to thousands of nodes and can store trillions of files. MapR officially set the MinuteSort record by sorting 1.5 TB of data in under a minute on

Google Compute Engine. A MapR customer has since exceeded that record by sorting 1.65 TB, with one-seventh the number of servers of the highest non-MapR record.

High Availability (HA)

MapR HA eliminates single points of failure to tolerate multiple node failures and ensure no unplanned downtime, no data loss, and no work loss. MapR HA requires no special configuration and is enabled automatically

YARN HA and JobTracker HA. Work is tracked to let them run to completion despite node failures. NFS HA. Continuous NFS access is ensured to avoid disruptions to standard file system access.

No-NameNode architecture. Cluster filename metadata is distributed to ensure the cluster data is always available and accessible.

Management and Monitoring

MapR Control System. To manage, administer, and monitor your Spark cluster, the MapR Control System (MCS) is a browserbased interface that lets you immediately view the status of your cluster via

heatmaps, and drill into specific issues to investigate any problems. Alarms proactively notify you if potential problems arise. Rolling upgrades. To minimize planned downtime, MapR allows a node-by-node

Spark upgrade on a live cluster. With MapR backward compatibility, existing applications can still run on an upgraded Spark cluster with no modifications.

Disaster Recovery (DR)

Customers can maintain continuity despite a site-wide disaster, and also can quickly recover damaged and accidentally deleted files. Mirrors. Mirrors are consistent copies of a cluster replicated to a remote site, either on-premises or in the cloud. Scheduled mirroring incrementally updates the mirror

by only sending block-level differentials from the source cluster to shorten the recovery point objective (RPO). Promotable mirrors enable fast and easy switchover of replicas to active production use to shorten the recovery time objective (RTO). Mirrors can also be used for load balancing as well as for wide geographic distribution to reduce network latency for distant end users.

Consistent snapshots. Capture the exact state of the cluster at the time the snapshot is taken, to enable point-in-time recovery of files that were corrupted or deleted due to application or user error. Snapshots are also useful for running machine learning algorithms on a static view of data, as well as for auditing data sets.

Download PDF