18 min read
This is Part 2 of a multi-blog series. The reader can refer to Part 1 of the series here.
In this blog, we will discuss how the MapR Data Platform has architected the four pillars that solve most data problems for enterprise and eliminate data silos.
At the heart of the MapR Data Platform is data container innovation. A MapR data container, also referred to as a container, is a unit of allocation and management of storage inside the MapR Data Platform. The container is the foundation for some of the core functionality of the MapR Data Platform.
Specifically, it provides:
Each container stores a variety of data elements, such as shards of files, objects, tables, pub/sub topics, and directories. The size of the container is elastic and can start from 0 and grow up to several GBs as it gets populated with data. Containers are distributed across all the nodes in a cluster to provide scale-out storage and processing. To prevent data loss from node or disk failure, each container (also referred to as a primary container) is replicated to multiple copies (also referred to as replicas) and are stored on different nodes in the cluster.
A Volume consists of one or more containers and spans across many nodes in a cluster. Each volume has one name container and one or more data containers. The name container stores metadata about the data in the volume. Volume is the unit of administration inside the MapR Data Platform. For example, an administrator specifies controls and properties at the volume level, such as data replication factor, security control (authorization and encryption), and mirroring relationship. All artifacts inside the volume reflect these properties. For example, when a volume is set with replication factor = 3, each container inside the volume is 3-way replicated. Similarly, when a volume is marked for mirroring, volume content is copied to a remote cluster to enable continuity of operations in case a disaster strikes the primary data center.
To prevent data loss in the event of node or disk failure, a primary container is replicated to additional copies that are stored on different nodes. With replication factor = n, a MapR cluster tolerates n-1 node failures without any data loss. In a large cluster consisting of commodity hardware, node failure or disk failure is common. When a node or disk goes bad, MapR automatically elects a new primary container from the replica. In addition, MapR also initiates rebuilding additional replicas. The reconstruction of a container happens as a background process, minimizing impact to users’ applications.
Let's look at this by using an example.
Diagram 1 shows a 9 node cluster with replication factor = 3 with 12 primary data containers (denoted by 12 different colors with number 1) and 24 replicas (denoted by 12 different colors with numbers 2 and 3).
After node 7 goes bad, as shown in Diagram 2, the replica of blue and purple containers are rebuilt on node 4. The replica of the brown container on node 5 is elected as the new primary, and a new replica of the brown container is rebuilt on node 8 as shown in Diagram 3.
Many popular data platforms, such as Apache HDFS or Apache Kafka, require a local file system, such as EXT4, as well as RAID disks in every node in order to operate. But due to the innovative data container and the Volume, the MapR Data Platform doesn’t require any special software or hardware to operate. A Linux operating system running on each node with local block devices, such as disks, SSDs or EBS, and a high-speed network connection, is sufficient to run the MapR Data Platform. More on this topic is covered in the deployment section.
Building a distributed metadata service is a very hard technical problem, which is why most data platforms implement centralized metadata service. Apache HDFS started with a single instance of the namenode running in the cluster (HDFS recently added support of master and replica to eliminate some of the scalability limitation). Apache HBase implements a single instance of META service running in the cluster. Centralized metadata service leads to a number of limitations, specifically:
MapR has built a distributed metadata service from the ground up that eliminates all these limitations. Specifically, the MapR Data Platform has implemented two levels of metadata service.
Container location database (aka CLDB) serves as MapR’s first level metadata service and maintains metadata about nodes, volumes, and containers in the cluster. For high availability, 3 or more CLDB services are typically running on 3 different nodes. CLDB maintains mapping between the container and the node where the container is located. Each CLDB service has a volume with only one name container that stores the container mapping to a node location.
The metadata about data artifacts, such as files, directories, tables, and topics, are maintained in the second level metadata and are stored in the name container of each volume.
When an application wants to read/write data, it contacts the CLDB service to determine the container location. After that, the application retrieves additional metadata from the name container of the volume. As we explained before, each volume has a name container and the name container of different volumes are distributed among different nodes, thereby distributing the metadata access workload among many nodes. After that, the application completes the actual read or write operation.
By distributing the metadata in CLDBs as well as in name containers of each volume, the MapR Data Platform ensures:
As discussed earlier, a MapR data container is the unit of storage allocation and management. Each container stores a variety of data elements such as shards of files, objects, tables, pub/sub topics, and directories. Because of the data container foundation, the MapR Data Platform provides native data persistence for files, objects, directories, tables, and pub/sub Topics, thus eliminating data silos and mismatch of security policies.
At the lowest level, MapR supports two types of data elements – file chunks and key-value stores. Regular files are built by striping file chunks across containers. Directories are built over key-value stores. Tables and tablets are built on top of files and key-value stores and optimized for index lookups as well as range scans. Pub/Sub Topics are built upon the table construct and optimized for low latency pub/sub models.
The MapR Data Platform provides support for a variety of APIs that developers can use to build applications that handle data at a high scale. More importantly, the MapR Data Platform provides data interoperability among different APIs. In other words, an application can ingest data using one set of API, and a different application can consume or analyze the data using a different set of API. Specifically, the MapR Data Platform supports the following APIs:
In addition, the MapR Data Platform provides data interoperability among different APIs to run a variety of workloads without creating new data silos. For example:
As a result of data interoperability, the MapR Data Platform:
The MapR Data Platform provides all the security functionality for administrators to run the cluster in a fully secured manner. Moreover, the secure-by-default functionality lets an administrator bring up a cluster in a secured mode very easily. Specifically, in the MapR Data Platform:
As discussed before, Apache HDFS or Apache Kafka require an underlying file system, such as Ext4 or a RAID disk in order to run in production deployment. But with data containers and volume abstraction at the core, the MapR Data Platform doesn’t require any special software or hardware.
Each node, running a Linux operating system with local block devices (disks, SSD, EBS) and connected over high-speed network connection, is sufficient. The MapR Data Platform does the heavy lifting, such as container distribution among nodes, replicating containers across nodes, providing failure resiliency, preventing data loss, mirroring to a remote data center to support disaster recovery, and so on.
Because of the two key innovations, namely Data Containers and Volumes, the MapR Data Platform runs on every deployment environment where data is collected and processed. Whether an on-premises cluster with 1000s of physical nodes, or 100s of virtual machines on a public cloud, or a 3-node Intel NUC-based cluster at the edge, the MapR Data Platform – unlike other major data platforms – takes full advantage of the infrastructure and delivers the same functionality.
The MapR Data Platform also provides innovative capabilities that make it easy to run in a multi-cloud environment that spans on-prem private cloud and public cloud. The hard part of running applications seamlessly in a multi-cloud environment boils down to how fast data can be made available. The MapR Data Platform provides data replication for different data persistence models. Specifically,
The MapR Data Platform eliminates, or at least reduces, data silos by virtue of the 4 pillars:
In addition, the MapR Data Platform runs on a variety of deployment environments with large clusters of physical nodes on-premises, or large clusters of virtual nodes on the public cloud, or a small cluster of Intel NUC at the edge, without any loss of functionality.
In the next blog post, we will discuss application development on the MapR Data Platform.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.