The MapR Data Platform provides a unified data solution for structured data (tables) and unstructured data (files).
This document contains the following topics related to the MapR Data Platform:
The MapR File System (MapR-FS) is a fully read-write distributed file system that eliminates the Namenode associated with cluster failure in other Hadoop distributions. MapR re-engineered the Hadoop Distributed File System (HDFS) architecture to provide flexibility, increase performance, and enable special features for data management and high availability.
The following table provides a list of some MapR-FS features and their descriptions:
A group of disks that MapR-FS writes data to.
An abstract entity that stores files and directories in MapR-FS. A container always belongs to exactly one volume and can hold namespace information, file chunks, or table chunks for the volume the container belongs to.
A service that tracks the location of every container in MapR-FS.
A management entity that stores and organizes containers in MapR-FS. Used to distribute metadata, set permissions on data in the cluster, and for data backup. A volume consists of a single name container and a number of data containers.
Read-only image of a volume at a specific point in time used to preserve access to deleted data.
Direct Access NFS
Enables applications to read data and write data directly into the cluster.
The following image represents disks grouped together to create storage pools that reside on a node:
Write operations within a storage pool are striped across disks to improve write performance. Stripe width and depth are configurable with the disksetup script. Since MapR-FS performs data replication, RAID configuration is unnecessary.
Containers and the CLDB
MapR-FS stores data in abstract entities called containers that reside on storage pools. Each storage pool can store many containers.
Blocks enable full read-write access to MapR-FS and efficient snapshots. An application can write, append, or update more than once in MapR-FS, and can also read a file as it is being written. In other Hadoop distributions, an application can only write once, and the application cannot read a file as it is written.
An average container is 10-30 GB. The default container size is 32GB. Large containers allow for greater scaling and allocation of space in parallel without bottlenecks.
Described from the physical layer:
Files are divided into chunks
The chunks are assigned to containers
The containers are written to storage pools, which are made up of disks on the nodes in the cluster
The following table compares the MapR-FS storage architecture to the HDFS storage architecture:
Files, directories and blocks, managed by Namenode.
Volume, which holds files and directories, made up of containers, which manage disk blocks and replication.
Size of file shard
Unit of replication
Unit of file allocation
MapR-FS automatically replicates containers across different nodes on the cluster to preserve data. Container replication creates multiple synchronized copies of the data across the cluster for failover. Container replication also helps localize operations and parallelizes read operations. When a disk or node failure brings a container’s replication levels below a specified replication level, MapR-FS automatically re-replicates the container elsewhere in the cluster until the desired replication level is achieved. A container only occupies disk space when an application or program writes to it.
Volumes are a management entity that logically organizes a cluster’s data. Since a container always belongs to exactly one volume, that container’s replicas all belong to the same volume as well. Volumes do not have a fixed size and they do not occupy disk space until MapR-FS writes data to a container within the volume. A large volume may contain anywhere from 50-100 million containers.
The CLI and REST API provide functionality for volume management. Typical use cases include volumes for specific users, projects, development, and production environments. For example, if an administrator needs to organize data for a special project, the administrator can create a specific volume for the project. MapR-FS organizes all containers that store the project data within the project volume.
A volume’s topology defines which racks or nodes a volume includes. The topology describes the locations of nodes and racks in the cluster.
The following image represents a volume that spans a cluster:
Volume topology is based on node topology. You define volume topology after you define node topology. When you set up node topology, you can group nodes by rack or switch. MapR-FS uses node topology to determine where to replicate data for continuous access to the data in the event of a rack or node failure.
MapR-FS creates a Name container for each volume that stores the volume’s namespace and file chunk locations, along with inodes for the objects in the filesystem. The file system stores the metadata for files and directories in the Name container, which is updated with each write operation.
When a volume has more than 50 million inodes, the system raises an alert that the volume is reaching the maximum recommended size.
Local volumes are confined to one node and are not replicated. Local volumes are part of the cluster’s global namespace and are accessible on the path
A snapshot is a read-only image of a volume at a specific point in time. Snapshots preserve access to deleted data and protect the cluster from user and application errors. Snapshots enable users to roll back to a known good data set. Snapshots can be created on-demand or at scheduled times.
New write operations on a volume with a snapshot are redirected to preserve the original data. Snapshots only store the incremental changes in a volume’s data from the time the snapshot was created.
The storage used by a volume's snapshots does not count against the volume's quota.
A mirror volume is a read-only physical copy of a source volume. Local (on the same cluster) or remote (on a different cluster) mirror volumes can be created from the MCS or from the command line to mirror data between clusters, data centers, or between on premise and public cloud infrastructures.
When a mirror volume is created, MapR-FS creates a temporary snapshot of the source volume. The mirroring process reads content from the snapshot into the mirror volume. The source volume remains available for read and write operations during the mirroring process.
The initial mirroring operation copies the entire source volume. Subsequent mirroring operations only update the differences between the source volume and the mirror volume. The mirroring operation never consumes all of the available network bandwidth, and throttles back when other processes need more network bandwidth.
Mirrors are atomically updated at the mirror destination. The mirror does not change until all bits are transferred, at which point all the new files, directories, and blocks are atomically moved into their new positions in the mirror volume.
MapR-FS replicates source and mirror volumes independently of each other.
Direct Access NFS
You can mount a MapR cluster directly through a network file system (NFS) from a Linux or Mac client. When you mount a MapR cluster, applications can read and write data directly into the cluster with standard tools, applications, and scripts. MapR enables direct file modification and multiple concurrent reads and writes with POSIX semantics. For example, you can run a MapReduce job that outputs to a CSV file, and then import the CSV file directly into SQL through NFS.
MapR exports each cluster as the directory /mapr/<cluster name>. If you create a mount point with the local path /mapr, Hadoop FS paths and NFS paths to the cluster will be the same. This makes it easy to work on the same files through NFS and Hadoop. In a multi-cluster setting, the clusters share a single namespace. You can see them all by mounting the top-level