MapR Data Platform Under the Hood

Contributed by

18 min read

This is Part 2 of a multi-blog series. The reader can refer to Part 1 of the series here.

Introduction

In this blog, we will discuss how the MapR Data Platform has architected the four pillars that solve most data problems for enterprise and eliminate data silos.

Data Containers

At the heart of the MapR Data Platform is data container innovation. A MapR data container, also referred to as a container, is a unit of allocation and management of storage inside the MapR Data Platform. The container is the foundation for some of the core functionality of the MapR Data Platform.

Specifically, it provides:

  • Different data persistence models, such as files, objects, tables, and event queues
  • Distributed scale-out storage
  • Data loss prevention
  • Failure resiliency and disaster recovery

Each container stores a variety of data elements, such as shards of files, objects, tables, pub/sub topics, and directories. The size of the container is elastic and can start from 0 and grow up to several GBs as it gets populated with data. Containers are distributed across all the nodes in a cluster to provide scale-out storage and processing. To prevent data loss from node or disk failure, each container (also referred to as a primary container) is replicated to multiple copies (also referred to as replicas) and are stored on different nodes in the cluster.

A Volume consists of one or more containers and spans across many nodes in a cluster. Each volume has one name container and one or more data containers. The name container stores metadata about the data in the volume. Volume is the unit of administration inside the MapR Data Platform. For example, an administrator specifies controls and properties at the volume level, such as data replication factor, security control (authorization and encryption), and mirroring relationship. All artifacts inside the volume reflect these properties. For example, when a volume is set with replication factor = 3, each container inside the volume is 3-way replicated. Similarly, when a volume is marked for mirroring, volume content is copied to a remote cluster to enable continuity of operations in case a disaster strikes the primary data center.

To prevent data loss in the event of node or disk failure, a primary container is replicated to additional copies that are stored on different nodes. With replication factor = n, a MapR cluster tolerates n-1 node failures without any data loss. In a large cluster consisting of commodity hardware, node failure or disk failure is common. When a node or disk goes bad, MapR automatically elects a new primary container from the replica. In addition, MapR also initiates rebuilding additional replicas. The reconstruction of a container happens as a background process, minimizing impact to users’ applications.

Let's look at this by using an example.

Diagram 1 shows a 9 node cluster with replication factor = 3 with 12 primary data containers (denoted by 12 different colors with number 1) and 24 replicas (denoted by 12 different colors with numbers 2 and 3).

After node 7 goes bad, as shown in Diagram 2, the replica of blue and purple containers are rebuilt on node 4. The replica of the brown container on node 5 is elected as the new primary, and a new replica of the brown container is rebuilt on node 8 as shown in Diagram 3.

Many popular data platforms, such as Apache HDFS or Apache Kafka, require a local file system, such as EXT4, as well as RAID disks in every node in order to operate. But due to the innovative data container and the Volume, the MapR Data Platform doesn’t require any special software or hardware to operate. A Linux operating system running on each node with local block devices, such as disks, SSDs or EBS, and a high-speed network connection, is sufficient to run the MapR Data Platform. More on this topic is covered in the deployment section.

4 Pillars of the MapR Data Platform

1. Distributed Metadata

Building a distributed metadata service is a very hard technical problem, which is why most data platforms implement centralized metadata service. Apache HDFS started with a single instance of the namenode running in the cluster (HDFS recently added support of master and replica to eliminate some of the scalability limitation). Apache HBase implements a single instance of META service running in the cluster. Centralized metadata service leads to a number of limitations, specifically:

  • Creates a single point of failure: i.e., when the centralized metadata service goes down, the entire cluster becomes unavailable.
  • Creates a hotspot that limits the scalability of the cluster: i.e., adding additional nodes to the cluster to run additional workloads with a guaranteed SLA will cease to work beyond a certain number of nodes, forcing users to create a new cluster that leads to more data silos.
  • Limits the number of the data artifacts that can be stored in the cluster: e.g., to store 10 million files in HDFS means 3 GB of namenode memory to maintain and serve the metadata. Scaling up beyond that is a problem.
  • Limits sharding of data artifacts: e.g., Apache Kafka Topic can be partitioned, but each Topic partition is restricted to residing on a single node in the cluster. When the local disks of a node are filled up, Kafka partitions on the node cannot store any additional data.

MapR has built a distributed metadata service from the ground up that eliminates all these limitations. Specifically, the MapR Data Platform has implemented two levels of metadata service.

Container location database (aka CLDB) serves as MapR’s first level metadata service and maintains metadata about nodes, volumes, and containers in the cluster. For high availability, 3 or more CLDB services are typically running on 3 different nodes. CLDB maintains mapping between the container and the node where the container is located. Each CLDB service has a volume with only one name container that stores the container mapping to a node location.

The metadata about data artifacts, such as files, directories, tables, and topics, are maintained in the second level metadata and are stored in the name container of each volume.

When an application wants to read/write data, it contacts the CLDB service to determine the container location. After that, the application retrieves additional metadata from the name container of the volume. As we explained before, each volume has a name container and the name container of different volumes are distributed among different nodes, thereby distributing the metadata access workload among many nodes. After that, the application completes the actual read or write operation.

By distributing the metadata in CLDBs as well as in name containers of each volume, the MapR Data Platform ensures:

  • No single point of failure. If the master CLDB goes down, a new master is elected automatically.
  • Looking up metadata is very fast. Also, the load to access metadata is distributed to many nodes in the cluster, eliminating the single choke point.
  • No physical limit on the number of nodes in a cluster and no physical limits on the number of files or tables that could be created. Users could literally store hundreds of billions of small and large files and tens of millions of tables and message topics in the MapR Data Platform if the cluster is large enough.
  • No restriction on sharding of data artifacts. For example, a Kafka topic on the MapR Data Platform spans across many nodes in the cluster.

2. Variety of Data Persistence Models

As discussed earlier, a MapR data container is the unit of storage allocation and management. Each container stores a variety of data elements such as shards of files, objects, tables, pub/sub topics, and directories. Because of the data container foundation, the MapR Data Platform provides native data persistence for files, objects, directories, tables, and pub/sub Topics, thus eliminating data silos and mismatch of security policies.

At the lowest level, MapR supports two types of data elements – file chunks and key-value stores. Regular files are built by striping file chunks across containers. Directories are built over key-value stores. Tables and tablets are built on top of files and key-value stores and optimized for index lookups as well as range scans. Pub/Sub Topics are built upon the table construct and optimized for low latency pub/sub models.

3. Variety of APIs and Protocol Support

The MapR Data Platform provides support for a variety of APIs that developers can use to build applications that handle data at a high scale. More importantly, the MapR Data Platform provides data interoperability among different APIs. In other words, an application can ingest data using one set of API, and a different application can consume or analyze the data using a different set of API. Specifically, the MapR Data Platform supports the following APIs:

  • HDFS API is provided for developers to build a variety of MapReduce or Hive or Spark-based batch analytic applications that operate on files and directories.
  • NFS, a distributed file system protocol, is provided to run many popular enterprise applications natively on the MapR Data Platform as well as ingest data into the MapR Data Platform.
  • POSIX, a IEEE standard interface that provides numerous constructs to operate on files, directories, etc., is provided for developers to build and run POSIX applications for high performance native file access on the MapR Data Platform.
  • S3 API is provided for developers to read/write S3 objects from/to the MapR Data Platform.
  • OJAI API, based on the JSON data format, is provided for developers to build applications that operate on database tables.
  • CDC API provides a streaming data exchange from the MapR Data Platform into other data systems that developers could use to build integration with those data systems.
  • Apache Kafka API is provided for developers to build Kafka applications on the MapR Data Platform.
  • SQL interface is provided for developers to build SQL-based applications to access data in the MapR Data Platform. Moreover, Apache Drill provides a layer of abstraction and eliminates the need for application developers to know if underlying data is in files or tables or pub/sub Topics.
  • JDBC/ODBC interface is provided for developers to integrate business intelligence tools with the MapR Data Platform and run a variety of applications, such as reporting and dashboards.
  • Rest API is provided for developers to build lightweight and scalable web-based databases or Kafka applications on the MapR Data Platform without the need of having a heavy client library dependency.
  • Container Storage Interface is provided to support applications that depend on data persistence in the container (e.g., Docker container) world. MapR provides support for the Kubernetes Flex Volume and the new storage standard called “CSI.”

In addition, the MapR Data Platform provides data interoperability among different APIs to run a variety of workloads without creating new data silos. For example:

  1. An email application may write data using POSIX API into the MapR Data Platform, and a Hadoop application using HDFS API may read and analyze that data.
  2. A web application may write objects using S3 API into the MapR Data Platform, and another application using POSIX API may consume and analyze that data.
  3. Kafka applications may produce/consume messages using Kafka APIs into/from the MapR Data Platform, and analytics applications may read and analyze that data using Spark SQL interface.
  4. A custom application may write customer transaction data in a table using OJAI API, and an ad hoc analytics query using the Drill SQL interface may read and analyze customer spend in real time.

As a result of data interoperability, the MapR Data Platform:

  • Eliminates the need to copy a subset of data into a different data platform to power a new application that must access/process that subset of data.
  • Makes it very easy to apply the intelligence gathered from the analytics back into the operational application.

4. Security

The MapR Data Platform provides all the security functionality for administrators to run the cluster in a fully secured manner. Moreover, the secure-by-default functionality lets an administrator bring up a cluster in a secured mode very easily. Specifically, in the MapR Data Platform:

  • Authentication is required for all access. As is good practice, MapR does not store user authentication data and instead relies on an external user registry, which can be configured to be any registry supported by Linux.
  • Data Encryption is done while data is at rest and/or in motion. For example, when a server is sending data to another process, the data will be encrypted to protect it from prying eyes. Additionally, when the data is written to disk, the data is also encrypted.
  • Audit capabilities allow security administrators to log all cluster administration operations as well as operations on data artifacts, such as directories, files, tables, and streams.
  • Although Access Control Lists (ACLs) were a substantial improvement over simple user and group permissions, they still could not easily express some common requirements, such as banning a single user from a group. In addition, the permission expressions supported by ACLs are a strict subset of those supported by Access Control Expressions (ACE). The MapR Data Platform implements ACE (as opposed to ACL) to provide fine-grained access control with ease of use. For example, a security administrator may associate ACE with volumes, files, tables, column families and columns in a table, and streams.
  • When selected, the secure-by-default option ensures that the MapR Data Platform and MapR Ecosystem Pack (MEP) components are secure out-of-the-box on all new installations, since all network connections require authentication and all data in motion is protected with wire-level encryption. The security semantics are applied automatically on data being retrieved or stored by any ecosystem component, application, or users.

Running on Bare Metal, Cloud, Multi-Cloud, Edge, or Kubernetes

As discussed before, Apache HDFS or Apache Kafka require an underlying file system, such as Ext4 or a RAID disk in order to run in production deployment. But with data containers and volume abstraction at the core, the MapR Data Platform doesn’t require any special software or hardware.

Each node, running a Linux operating system with local block devices (disks, SSD, EBS) and connected over high-speed network connection, is sufficient. The MapR Data Platform does the heavy lifting, such as container distribution among nodes, replicating containers across nodes, providing failure resiliency, preventing data loss, mirroring to a remote data center to support disaster recovery, and so on.

Because of the two key innovations, namely Data Containers and Volumes, the MapR Data Platform runs on every deployment environment where data is collected and processed. Whether an on-premises cluster with 1000s of physical nodes, or 100s of virtual machines on a public cloud, or a 3-node Intel NUC-based cluster at the edge, the MapR Data Platform – unlike other major data platforms – takes full advantage of the infrastructure and delivers the same functionality.

The MapR Data Platform also provides innovative capabilities that make it easy to run in a multi-cloud environment that spans on-prem private cloud and public cloud. The hard part of running applications seamlessly in a multi-cloud environment boils down to how fast data can be made available. The MapR Data Platform provides data replication for different data persistence models. Specifically,

  • Replication of individual files from one cluster to another cluster keeps file data in sync in real time
  • Multi-master table replication keeps table data in sync among clusters on private and public cloud in real time
  • Global topic replications ensure message topics are in sync in real time among different Kafka topics

Summary

The MapR Data Platform eliminates, or at least reduces, data silos by virtue of the 4 pillars:

  • A plethora of API and protocol support that enables developers to build a variety of application workloads
  • Different data persistence models with objects, files, tables, and pub/sub Topics for developers to optimize their application workload
  • Distributed metadata that eliminates restrictions on scalability on cluster size, number of data artifacts, and data sharding
  • Single security model for all data

In addition, the MapR Data Platform runs on a variety of deployment environments with large clusters of physical nodes on-premises, or large clusters of virtual nodes on the public cloud, or a small cluster of Intel NUC at the edge, without any loss of functionality.

In the next blog post, we will discuss application development on the MapR Data Platform.


This blog post was published November 29, 2018.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now