MapR-FS enables you to create and manipulate tables in many of the same ways that you create and manipulate files in a standard UNIX file system.
A unified architecture for files and tables provides distributed data replication for structured and unstructured data. Tables enable you to manage structured data, as opposed to the unstructured data management provided by files. The structure for structured data management is defined by a data model, a set of rules that defines the relationships in the structure.
By design, the data model for tables in MapR focuses on columns, similar to the open-source standard Apache HBase system. Like Apache HBase, MapR-DB tables store data structured as a nested sequence of key/value pairs. For example, in the key/value pair
family, the value
column family becomes the key for the key/value pair
family:column. With an M7 license, you can use MapR-DB tables, HBase tables, or a combination of both in your Hadoop environment.
MapR-DB tables are implemented directly within MapR-FS, yielding a familiar, open-standards API that provides a high-performance datastore for tables. MapR-FS is written in C and optimized for performance. As a result, MapR-FS runs significantly faster than JVM-based Apache HBase.
Benefits of Integrated Tables in MapR-FS
The MapR cluster architecture provides the following benefits for table storage, providing an enterprise-grade HBase environment.
MapR clusters with HA features recover instantly from node failures.
MapR provides a unified namespace for tables and files, allowing users to group tables in directories by user, project, or any other useful grouping.
Tables are stored in volumes on the cluster alongside unstructured files. Storage policy settings for volumes apply to tables as well as files.
Volume mirrors and snapshots provide flexible, reliable read-only access.
Table storage and MapReduce jobs can co-exist on the same nodes without degrading cluster performance.
The use of MapR-DB tables imposes no administrative overhead beyond administration of the MapR cluster.
Node upgrades and other administrative tasks do not cause downtime for table storage.
HBase on MapR
MapR's implementation of the HBase API provides enterprise-grade high availability (HA), data protection, and disaster recovery features for tables on a distributed Hadoop cluster. MapR-DB tables can be used as the underlying key-value store for Hive, or any other application requiring a high-performance, high-availability key-value datastore. Because MapR uses the open-standard HBase API, many legacy HBase applications can continue to run on MapR without modification.
MapR has extended the HBase shell to work with MapR-DB tables in addition to Apache HBase tables. Similar to development for Apache HBase, the simplest way to create tables and column families in MapR-FS, and put and get data from them, is to use the HBase shell. MapR-DB tables can be created from the MapR Control System (MCS) user interface or from the Linux command line, without the need to coordinate with a database administrator. You can treat a MapR-DB table just as you would a file, specifying a path to a location in a directory, and the table appears in the same namespace as your regular files. You can also create and manage column families for your table from the MCS or directly from the command line.
During data migration or other specific scenarios where you need to refer to a MapR-DB table of the same name as an Apache HBase table in the same cluster, you can map the table namespace to enable that operation.
MapR does not support hooks to manipulate the internal behavior of the datastore, which are common in Apache HBase applications. The Apache HBase codebase and community have internalized numerous hacks and workarounds to circumvent the intrinsic limitations of a datastore implemented on a Java Virtual Machine. Some HBase workflows are designed specifically to accommodate limitations in the Apache HBase implementation. HBase code written around those limitations will generally need to be modified in order to work with MapR-DB tables.
MapR-DB tables use the open-standard HBase API.
MapR-DB tables implement the HBase feature set.
MapR-DB tables can be used as the datastore for Hive applications.
Unlike Apache HBase tables, MapR-DB tables do not support manipulation of internal storage operations.
Apache HBase applications crafted specifically to accommodate architectural limitations in HBase will require modification in order to run on MapR-DB tables.
Effects of Decoupling API and Architecture
The following features of MapR-DB tables result from decoupling the HBase API from the Apache HBase architecture:
MapR's High Availability (HA) cluster architecture eliminates the RegionServer and HBaseMaster components of traditional Apache HBase architecture, which are common single points of failure and scalability bottlenecks. In MapR-FS, MapR-DB tables are HA at all levels, similar to other services on a MapR cluster.
MapR-FS allows an unlimited number of tables, with cells up to 16MB.
MapR-DB tables can have up to 64 column families, with no limit on number of columns.
MapR-FS automates compaction operations and splitting for MapR-DB tables.
Crash recovery is significantly faster than Apache HBase.