Apache HBase

Distributed, Scalable, NoSQL Big Data Store for Hadoop

FREE TRAINING

Apache HBase Data Model and Architecture

Apache HBase logo

WHAT IS APACHE HBASE?

Apache HBase is a distributed, scalable, NoSQL big data store that runs on a Hadoop cluster. HBase can host very large tables – billions of rows, millions of columns – and can provide real-time, random read/write access to Hadoop data. HBase is a wide-column data store modeled after Google Bigtable, the database interface to the proprietary Google File System. HBase provides Bigtable-like capabilities on top of Hadoop-compatible file systems, such as MapR XD. HBase scales linearly across very large datasets and easily combines data sources with different structures and schemas.

Free Hadoop Training:
Developing Apache HBase Applications: Basics

  • Learn how to write HBase programs using Hadoop as a distributed NoSQL datastore
  • Learn how to use the Java API to perform CRUD operations
  • Learn how to use helper classes
  • Learn how to create and delete tables
  • Learn how to set and alter column family properties
  • Learn how to and batch updates

Take a moment to explore the free course.

HBase is not a traditional relational database (RDBMS). HBase was designed to scale across a cluster. Some of the key properties of HBase include:

  • NoSQL. Distributed and scalable. HBase groups rows into regions that define how table data is split over multiple nodes in a cluster. If a region gets too large, it is automatically split to share the load across more servers.
  • Wide-column. HBase stores data in a table-like format with the ability to store billions of rows with millions of columns. Columns can be grouped together in column families, which allows physical distribution of row values on different cluster nodes.
  • Unstructured or semi-structured data. Data stored in HBase does not need to fit into a rigid schema like with an RDBMS, making it ideal for storing unstructured or semi-structured data.
  • Consistent. HBase is architected to have strongly consistent reads and writes, as opposed to other NoSQL databases, like Cassandra, that are eventually consistent. Once a write has been performed, all read requests for that data will return the same value.
  • Failover. HBase tables are replicated for failover.

WHY APACHE HBASE?

HBase cons icon

Challenges with Previous Technologies

  • Relational databases were the standard for years, so what changed? With more and more data came the need to scale. However relational databases were designed for a single node. They were not designed to be run on clusters.
  • With a relational database, you normalize your schema, which eliminates redundant data and makes storage efficient. Indexes and queries with joins are used to bring the data back together again. Indexes slow down data ingestion with lots of nonsequential disk I/O, and joins cause bottlenecks on reads with lots of data. The relational model does not scale horizontally across a cluster.
  • RDBMS architecture diagram
HBase pros icon

Advantages of HBase

  • HBase supports large volumes of data by running on clusters.
  • HBase was designed to scale; data that is accessed together is stored together. Grouping the data by row key is central to running on a cluster. In HBase, the data is automatically distributed across a cluster. Sharding distributes different data across multiple servers, and each server is the source for a subset of data. Distributed data is accessed together, which makes it faster for scaling.
  • Diagram of sharding
  • HBase was created to host very large tables for interactive and batch analytics, making it a great choice to store multi-structured or sparse data. You can use Apache HBase when you need random, real-time read/write access to your big data. HBase is natively integrated with Hadoop and can work seamlessly with other data access engines such as Apache Spark, Apache Hive, and MapR Database.
  • Like many NoSQL databases, HBase was designed to work on large datasets that are either too large or expensive to process with a commercial RDBMS. HBase provides close to real-time data access upon a key attribute, which usually requires a denormalized schema design so that data is accessed and stored together.

WHY APACHE HBASE WITH MAPR?

HBase cons icon

Challenges with HBase on Top of HDFS

  • With other Hadoop distributions, HBase runs on top of the Hadoop Distributed File System (HDFS), the Hadoop bottom layer component for storage.
  • HBase on Hadoop diagram
  • As seen in the diagram, there are many layers. HDFS is separate from the underlying file system, and HBase is separate from HDFS. Also, HDFS has several problems:
    • HDFS is written in Java on top of the Linux file system and is a write once storage layer. Updates to closed files are conducted via an append process. The batch updates of HDFS are a major limitation. There is no support for continuous updates to a file. Moreover, HDFS relies on the underlying Linux file system to store the HDFS content.
    • The NameNode, the part of the master node that identifies the location of each file block, has scalability and reliability issues. NameNodes are hard to configure, and as they are replicated so as not to become single points of failure, configuration gets even harder.
  • These layers and separation with the limitations of a write once HDFS file system lead to several HBase problems.
HBase pros icon

Advantages of HBase on Top of MapR XD

  • MapR XD exposes an HDFS API maintaining compatibility with the Hadoop APIs, and HBase can run on top of MapR XD.
  • HBase on MapR XD
  • While maintaining the core distribution, MapR has conducted proprietary development in some critical areas where the open-source community has not been able to solve Hadoop’s design flaws. MapR replaces one or more components, packages the rest of the open source components, and maintains compatibility with Hadoop.
  • Improving HDFS for high performance and high availability
  • HDFS does not support enterprise-grade performance, and MapR sought to change that in a number of ways:
    • MapR replaced HDFS so it would not be reliant on Java, Java Garbage collection, or the underlying Linux file system.
    • MapR XD, MapR's distributed file system, implements a random read-write file system natively in C++ and accesses disks directly. HDFS is an append-only file system that can only be written once.
    • MapR introduces a Container Location Database instead of the NameNode in HDFS. A container database is more stable as container allocation is not changed as often as files and takes less space.
    • MapR decentralizes and distributes file location metadata increasing scalability to 1 trillion files compared to 100 million files in HDFS. Because each node in the cluster contains a copy of the file metadata, all nodes can participate in failure recovery instead of requiring each node to call back to a central instance.
  • MapR provides open APIs between Hadoop clusters and other common environments in the enterprise, including POSIX NFS, S3, HDFS, HBase, SQL, and Kafka. With MapR, Hadoop gains a full read/write storage system that supports multiple and full random readers and writers.

Free Hadoop Training:
Apache HBase Schema Design

  • Learn how to design HBase schemas based on design guidelines
  • Learn how to use the Java API to perform CRUD operations
  • Learn the various elements of schema design and how to design for data access patterns
  • Take an in-depth look at designing row keys, avoiding hot-spotting, and designing column families
  • Learn how to transition from a relational model to an HBase model
  • Learn the differences between tall tables and wide tables

Take a moment to explore the free course.

WHY MAPR DATABASE?

An alternative to running HBase on top of MapR Distributed File and Object Store (MapR XD) is running MapR Database, a high-performance NoSQL database built into the MapR Data Platform. The MapR Database implementation integrates table storage into MapR XD, eliminating all JVM layers and interacting directly with disks for both file and table storage. MapR Database is multi-model: wide-column, key-value with the HBase API or JSON (document) with the OJAI API, allowing developers to choose the model best suited to their use case.

MapR Database integrated into file system

MapR Database brings together operations, analytics, real-time streaming, and database workloads to enable a broader set of next-generation data-intensive applications. The MapR Database is integrated with MapR XD Distributed File and Object Store as well as MapR Event Store for Apache Kafka, resulting in the most comprehensive dataware for businesses to run nearly any workload on a single cluster in production.

MapR Database has several advantages over HBase:

  • Predictable low latency
  • No need for manual intervention for compactions, region splitting, and recovery:
    • Partitioning (region splitting) by row key is automatic and fast.
    • Instant recovery occurs with zero data loss.
    • MapR Database does not need to do compaction.

For more information read An In-Depth Look at How MapR Database Does What Cassandra, HBase, and Others Can't and Top 10 Reasons Developers Choose MapR Database.

KEY BENEFITS OF MAPR DATABASE

MULTI-MODEL FLEXIBILITY icon

MULTI-MODEL FLEXIBILITY

MapR Database supports multiple data models including document, wide-column, key-value, and time series on a unified foundation.

NATIVE JSON icon

NATIVE JSON SIMPLICITY WITH EXPRESSIVE QUERIES

MapR Database is a highly scalable document database with native JSON support. It provides intuitive and expressive OJAI query language to build powerful applications.

STRONG CONSISTENCY icon

STRONG CONSISTENCY – NO DATA LOSS

MapR Database has strong consistency by default and always. MapR Database has in-sync replication (factor 3) always on. Once data is acknowledged, it will never be lost or corrupted.

EXTREME PERFORMANCE icon

EXTREME PERFORMANCE AND EFFORTLESS HORIZONTAL SCALE

In recent benchmarks validated by ESG, MapR Database was observed to be 2.5x faster than Cassandra and 5.5x faster than HBase on average across all workloads.

HIGH AVAILABILITY icon

EXTREME HIGH AVAILABILITY

MapR Database inherits the enterprise features of the underlying platform with respect to failure handling, recovery, and resiliency.

REPLICATION icon

GLOBAL MULTI-MASTER REPLICATION

The MapR Data Platform provides volume- and topology-based placement controls to enable multiple MapR Database applications to run securely and independently in the same cluster.

IN-PLACE SQL icon

IN-PLACE SQL AND ADVANCED ANALYTICS/ML

MapR Database is natively integrated with machine learning and analytical tools to enable advanced analytics, data exploration, and interactive SQL, letting you immediately analyze or process live data and apply machine learning.

MULTI-TENANCY icon

OPTIMIZED MULTI-TENANCY FOR THOUSANDS OF APPS

The MapR Data Platform provides volume and topology based data placement controls to support multi-tenancy, which means multiple MapR Database applications can run securely and independently in the same cluster without impacting SLAs. This results in lower administrative and hardware costs.

INTEGRATED STREAMING icon

INTEGRATED STREAMING FOR REAL-TIME DATA INGEST, PROCESSING, AND INTEGRATION

MapR Database is integrated with MapR Event Store for Apache Kafka out of the box. MapR Event Store is a global event streaming system that enables real-time data ingestion and continuous stream processing.

SECURITY icon

ROBUST SECURITY AND FINE-GRAINED ACCESS CONTROL

MapR Database allows security policies on the sub-document and the element level. You can set strict policies only for the confidential elements instead of whole documents.

WHY MAPR DATABASE MATTERS TO YOU

DEVELOPERS icon

DEVELOPERS

  • MapR Database supports multiple data models including wide-column, document, key value, and time-series on a unified foundation.
  • MapR Database is natively integrated with machine learning and analytical processing to enable advanced analytics, data exploration, and interactive SQL.
  • In recent benchmarks validated by ESG, MapR Database was observed to be 2.5x faster than Cassandra and 5.5x faster than HBase on average across all workloads.
  • MapR Database is integrated with MapR Event Store for Apache Kafka out of the box for real-time data flows. MapR Event Store is a global event streaming system that enables real-time data ingestion and stream processing.
  • MapR Database connector for Spark provides easier, faster data pipelines:
    • Build complex ETL pipelines that can speed up data ingestion and deliver superior performance
    • Combine event streams with machine learning to handle the logistics of machine learning
  • Persist data for containerized applications. MapR Data Fabric for Kubernetes allows for MapR volumes to be mounted for access by containers.
  • Scale data as containers grow. With a “grow as you go” feature, MapR handles growth in data without having to move it to a separate, dedicated environment.
IT / STORAGE ADMINISTRATOR icon

IT / STORAGE ADMINISTRATOR

  • The MapR Data Platform is built for production. Consistent snapshots, replicas, and mirroring deliver enterprise-grade high availability and disaster recovery.
  • MapR XD is multi-tenant by design. Assign policies (quotas, permissions, placement) to logical units of management called volumes.
  • Balance cost and performance with MapR XD. Leverage policy-based data tiering, erasure coding, data placement, and more.
  • MapR Database has strong consistency by default and always. MapR Database has in-sync replication (factor 3) always on, and once data is acknowledged, it will never be lost or corrupted.
  • MapR Database inherits the enterprise features of the underlying platform with production-ready failure handling, recovery, and resiliency.

Spark Streaming with HBase

Carol McDonald, Industry Solutions Architect, covers:

  • What is Spark Streaming and what is it used for?
  • How does Spark Streaming work?
  • Example code to read, process, and write the processed data

Take a moment to read the blog post.

BEYOND HBASE AND HADOOP

A confluence of several different technology shifts have dramatically changed big data and machine learning applications. The combination of distributed computing, streaming analytics, and machine learning is accelerating the development of next-generation intelligent applications, which take advantage of modern computational paradigms powered by modern computational infrastructure. The MapR Data Platform integrates global event streaming, real-time database capabilities, and scalable enterprise storage with Hadoop, HBase, Spark, Drill, and machine learning libraries to power this new generation of data processing pipelines and intelligent applications. Diverse and open APIs allow all types of analytics workflows to run on the data in place:

  • The MapR XD Distributed File and Object Store is designed to store data at exabyte scale, support trillions of files, and combine analytics and operations in a single platform. MapR XD supports industry-standard protocols and APIs, including POSIX, NFS, S3, and HDFS. Unlike Apache HDFS, which is write once, append only, the MapR Data Platform delivers a true read-write, POSIX-compliant file system. Support for the HDFS API enables Spark and Hadoop ecosystem tools, for both batch and streaming, to interact with MapR XD. Support for POSIX enables Spark and all non-Hadoop libraries to read and write to the distributed data store as if the data were mounted locally, which greatly expands the possible use cases for next-generation applications. Support for an S3-compatible API means MapR XD can also serve as the foundation for Spark applications that leverage object storage.
  • The MapR Event Store for Apache Kafka is the first big-data-scale streaming system built into a unified data platform and the only big data streaming system to support global event replication reliably at IoT scale. Support for the Kafka API enables Spark streaming applications to interact with data in real time in a unified data platform, which minimizes maintenance and data copying.
  • MapR Database is a high-performance NoSQL database built into the MapR Data Platform. MapR Database is multi-model: wide-column, key-value with the HBase API or JSON (document) with the OJAI API. Spark connectors are integrated for both HBase and OJAI APIs, enabling real-time and batch pipelines with MapR Database:
    • The MapR Database Connector for Apache Spark enables you to use MapR Database as a sink for Spark Structured Streaming or Spark Streaming.
    • The Spark MapR Database Connector enables users to perform complex SQL queries and updates on top of MapR Database, while applying critical techniques such as projection and filter pushdown, custom partitioning, and data locality.

MapR put key technologies essential to achieving high scale and high reliability in a fully distributed architecture that spans on-premises, cloud, and multi-cloud deployments, including edge-first IoT, while dramatically lowering both the hardware and operational costs of your most important applications and data.

MapR distributed architecture

CUSTOMERS USING THE MAPR DATA PLATFORM