MapR Platform not only provides advanced high availability (HA) and data protection features such as resilience upon multiple node failures, snapshots, and mirroring for disaster recovery (DR), but also enables seamless Spark access and data management capabilities through industry-standard interfaces such as NFS and ODBC.
MapR Breakthrough Innovations
You can try the MapR Sandbox, install MapR on an on-premise cluster, or deploy it in the cloud.
Apache Spark is a general-purpose engine for large-scale data processing. It supports rapid application development for big data and allows for code reuse across batch, interactive, and streaming applications. Apache Spark delivers in-memory processing for big data and enables faster application development. The most popular use cases for Apache Spark include building data pipelines and developing machine learning models. MapR is the choice for production Spark applications.
The MapR Platform including Spark consists of the complete Spark stack engineered to support advanced analytic applications, along with patented innovations in the MapR Platform, plus key open source projects that complement Spark. This enables advanced analytics including batch processing, machine learning, SQL, and graph computation. Because Spark runs seamlessly on MapR, it benefits from the platform’s patented enterprise-grade features such as web-scale storage, high availability, mirroring, snapshots, NFS, integrated security, global namespace, etc.
Support for the Complete Spark Stack
MapR has added significant innovations to improve Spark performance, reliability, and flexibility.
Standards-based file access. Unlike other distributions, MapR provides true Network File System (NFS) capabilities. MapR Direct Access NFS™ lets you access Spark like a standard file system, to copy data in and out easily at high rates, or to access Spark
data using common command line tools and desktop applications. The optional add-on MapR POSIX Client provides authenticated NFS access from remote nodes, along with over-the-wire compression and parallel access to boost throughput.
MapR fully supports additional industry-standard APIs, including ODBC/JDBC, LDAP, Kerberos, HBase, HDFS, NFS, and more.
Kerberos and LDAP integration. MapR supports authentication services via Kerberos and/or LDAP. Access control. Data is secured using standard Unix file permissions and advanced role-based access control expressions (ACEs).
Native authentication. MapR also offers a standards-based authentication system as a simpler alternative to Kerberos that leverages Linux Pluggable Authentication Modules (PAM) to provide the widest registry support. Comprehensive auditing. MapR auditing logs help to analyze user behavior as
well as to meet regulatory compliance requirements. MapR uses the JSON format to log accesses at the administrative, authentication, database, and file levels. Performant wire-level encryption. MapR encrypts data sent between nodes and applications to ensure data privacy, using Intel AES-NI capabilities where available.
Multi-tenancy. MapR supports multitenancy beyond the capabilities in YARN via advanced resource management and control to let distinct user groups, data sets, and applications coexist in isolation in the same cluster. Security. MapR authentication and authorization controls provide another
level of user and data isolation. Volumes. MapR supports the logical grouping of files and directories on which policies (permissions, replication factors, quotas, etc.) can be set. ExpressLane. MapR avoids starvation of small jobs by letting them run even when the cluster is busy with large jobs.
Job placement control. Even beyond YARN, MapR manages resources with label-based job placement to specify which nodes can run the specified job. Data placement control. Configure the cluster topology to define on which nodes specific data is placed for performance, security, and optimal utilization purposes.
Customers can reduce their data center footprint with the MapR performance advantage by deploying as few as one-third the servers of other distributions
A MapR cluster can scale to thousands of nodes and can store trillions of files. MapR officially set the MinuteSort record by sorting 1.5 TB of data in under a minute on
Google Compute Engine. A MapR customer has since exceeded that record by sorting 1.65 TB, with one-seventh the number of servers of the highest non-MapR record.
MapR HA eliminates single points of failure to tolerate multiple node failures and ensure no unplanned downtime, no data loss, and no work loss. MapR HA requires no special configuration and is enabled automatically
YARN HA and JobTracker HA. Work is tracked to let them run to completion despite node failures. NFS HA. Continuous NFS access is ensured to avoid disruptions to standard file system access.
No-NameNode architecture. Cluster filename metadata is distributed to ensure the cluster data is always available and accessible.
MapR Control System. To manage, administer, and monitor your Spark cluster, the MapR Control System (MCS) is a browserbased interface that lets you immediately view the status of your cluster via
heatmaps, and drill into specific issues to investigate any problems. Alarms proactively notify you if potential problems arise. Rolling upgrades. To minimize planned downtime, MapR allows a node-by-node
Spark upgrade on a live cluster. With MapR backward compatibility, existing applications can still run on an upgraded Spark cluster with no modifications.
Customers can maintain continuity despite a site-wide disaster, and also can quickly recover damaged and accidentally deleted files. Mirrors. Mirrors are consistent copies of a cluster replicated to a remote site, either on-premises or in the cloud. Scheduled mirroring incrementally updates the mirror
by only sending block-level differentials from the source cluster to shorten the recovery point objective (RPO). Promotable mirrors enable fast and easy switchover of replicas to active production use to shorten the recovery time objective (RTO). Mirrors can also be used for load balancing as well as for wide geographic distribution to reduce network latency for distant end users.
Consistent snapshots. Capture the exact state of the cluster at the time the snapshot is taken, to enable point-in-time recovery of files that were corrupted or deleted due to application or user error. Snapshots are also useful for running machine learning algorithms on a static view of data, as well as for auditing data sets.