Database Comparison: An In-Depth Look at How MapR-DB Does What Cassandra, HBase, and Others Can't

Contributed by Carol McDonald

If you are a developer or architect working with a highly performant product, you want to understand what differentiates it, similar to a race car driver driving a highly performant car. My in depth architecture blog posts such as An In-Depth Look at the HBase Architecture, Apache Drill Architecture: The Ultimate Guide, and How Stream-First Architecture Patterns Are Revolutionizing Healthcare Platforms, have been hugely popular and I hope this one will be too.

In this blog post, I’ll give you an in-depth look at the MapR-DB architecture compared to other NoSQL architectures like Cassandra and HBase, and explain how MapR-DB delivers fast, consistent, scalable performance with instant recovery and zero data loss.

MapR-DB Comparison

Why NoSQL ?

Simply put the motivation behind NoSQL is data volume, velocity, and/or variety. MapR-DB provides for data variety with two different data models:

  • Wide column data model exposing an HBase API
  • Document data model exposing an Open JSON API (similar to the MongoDB API)

Multi-model flexibility

Concerning data volume and velocity, recent ESG Labs analysis determined that MapR-DB outperforms Cassandra and HBase by 10x in the Cloud, with the operations/sec shown below: (high operations/sec is good)

mapr-db-performance-advantage

How do you get fast reads and writes with high performance at scale?

The key is partitioning for parallel operations, and minimizing time spent on Disk reads and writes.

Limitations of a Relational Model

With a relational database you normalize your schema, which eliminates redundant data and makes storage efficient. Then you use indexes and queries with joins to bring the data back together again. However indexes slow down data ingestion with lots of non sequential disk I/O and joins cause bottlenecks on reads with lots of data. The relational model does not scale horizontally across a cluster.

Limitations of a Relational Model

MapR-DB designed to automatically partition by row key

With MapR-DB (HBase API or JSON API), a table is automatically partitioned across a cluster by key range, and each server is the source for a subset of a table. MapR-DB has a “query-first” schema design, queries should be identified first, then the row key should be designed to distribute the data evenly and also to give a meaningful primary index. The row document (JSON) or columns (HBase) should be designed to group data together that will be read together. With MapR-DB you de-normalize your schema to store in one row or document what would be multiple tables with indexes in a relational world. Grouping the data by key range provides for fast reads and writes by row key.

MapR-DB - Designed to scale horizontally

You can read more about MapR-DB HBase schema design here, and a future post will discuss JSON document design.

How do NoSQL data stores get fast writes?

Traditional databases based on b-trees deliver super fast reads, but they are costly to update in real-time because the disk I/O they do is disorganized and inefficient.

Fast Writes

(Image reference https://en.wikipedia.org/wiki/File:Btree_index.PNG)

Log Structured Merge trees (or LSM trees) are designed to provide a higher write throughput than traditional B tree file organisations by minimizing non-sequential disk I/O. LSM trees are used by datastores such as HBase, Cassandra, MongoDB and others. LSM trees write updates to a log on disk and memory. Updates are appended to the log, which is only used for recovery, and sorted in the memory store. When the memory store is full, it flushes to a new immutable file on disk.

LSM Tress

Read Amplification

This design provides fast writes however as more new files are written to disk, if queried row values are not in memory, multiple files may have to be examined to get the row contents, which slows down reads.

Read Amplification

Write Amplification

In order to speed up reads, by command or schedule, a background process “compacts” multiple files by reading them, merge sorting them in memory and writing the sorted keyValues into a new larger file.

Write Amplification

However compaction causes lots of disk I/O which decreases write throughput, this is called write amplification. Compaction for Cassandra requires 50% free disk space, and will bring the OS to a halt if compaction runs out of disk space. Read and write amplification cause unpredictable latencies with HBase and Cassandra. Recently ESG Labs confirmed that MapR-DB outperforms Cassandra and HBase by 10x in the Cloud, with predictable consistent low latency( low latency is good, means fast response).

Low Predictable Latency with MapR-DB

How does MapR-DB get predictable low latency?

LSM Trees allow faster writes than B-trees, but they don’t have the same read performance. Also, with LSM trees, there is a balance between compacting too infrequently (read performance can be impacted) or too often (write performance can be impacted). MapR-DB strikes a middle ground and avoids the large compactions of HBase and Cassandra as well as avoiding the highly random I/O of traditional systems. The result is rock-solid latency and very high throughput.

To understand how MapR-DB can do what others can’t, you have to understand the MapR converged platform. Apache HBase and Apache Cassandra run on an append-only file system, meaning it isn’t possible to update the data files as data changes. What’s unique about MapR is that MapR-DB tables, MapR Files and MapR-Event Streams are integrated into the MapR-XD high-scale, reliable, globally distributed data store. MapR-XD implements a random read-write file system natively in C++ and accesses disks directly, making it possible for MapR-DB to do efficient file updates instead of always writing to new immutable files.

MapR CDP

MapR-DB invented a hybrid LSM-tree / B-tree to achieve consistent fast reads and writes. Unlike a traditional B-Tree, leaf nodes are not actively balanced, so updates can happen very quickly. With MapR-DB updates are appended to small “micro” logs, and sorted in memory. Unlike LSM-trees which flush to new files, with MapR-DB micro-reorganizations happen when memory is frequently merged into a read/write file system, meaning that MapR-DB does not need to do compaction! (Micro logs are deleted after the memory has been merged, since they are no longer needed.) MapR-DB reads leverage a B-Tree to “get close, fast” to the ranges of possible values, then scan to find the matching data.

MapR-XD How do HBase and Cassandra do recovery?

Before I said that the log is only used for recovery, so how does recovery work ? In order to achieve reliability on commodity hardware, one has to resort to a replication scheme, where multiple redundant copies of data are stored. How does HBase do recovery? HDFS breaks files into blocks of 64 MB. Each block is stored across a cluster of DataNodes. and replicated 2 times.

Network

If an HDFS data node crashes, then the region will be assigned to another data node. The new Region is available after the Log has been replayed, which means reading all of the not yet flushed updates from the Log into memory and flushing them as sorted key values into new files.

HDFS Data Node

The HBase recovery process is slow and has the following problems:

  • HDFS does disk I/O in large Blocks of 64MB.
  • HDFS files are read-only (write append). There’s a single writer per file - no reads are allowed to a file while it’s being written - and the file close is the transaction that allows readers to see the data. When a failure occurs, unclosed files are deleted.
  • Because of the layers between HDFS and HBase, data locality is only guaranteed when new files are written. With HBase whenever a region is moved for failover or load balancing the data is not local, meaning the region server is reading from files on another node.
  • Because the LOG (or WAL) is large, replay can take time.
  • Because of the separation from the file system, coordination is required between zookeeper/hbase-master/region-server/name node on failover.

So how does recovery work for Cassandra? With Cassandra, the client performs a write by sending the request to any node, which will act as the proxy to the client. This proxy node will locate N corresponding nodes that hold the data replicas and forward the write request to all of them. If a node fails the coordinator still writes to the other replicas and the failed replica becomes inconsistent.

Cassandra Application

According to Datastax downed nodes are common causes of data inconsistency with Cassandra, and need to be routinely fixed by manually running an anti-entropy repair tool. According to Robert Yokota from Yammer, Cassandra has not been more reliable than their strongly consistent systems, yet Cassandra has been more difficult to work with and reason about in the presence of inconsistencies.

How does MapR-DB get instant recovery with zero data loss?

MapR-DB gets instant recovery with zero data loss

To understand how MapR-DB can do what others can’t you have to understand the MapR file system replication with 24/7 reliability and zero data loss. In normal operations, the cluster needs to write and replicate incoming data as quickly as possible. When files are written to the MapR cluster, they are first sharded into pieces, called 'chunks.' Each chunk is written to a container as a series of blocks of 8K at a 2gig/second update rate. Once the chunk is written, it is replicated as a series of blocks. This is repeated for each chunk until the entire file has been written. (You can see this animated in this MapR-XD video) MapR ensures that every write that was replied to survives a crash.

MapR-XD uses three sizes

Where HDFS uses a single block size for sharding, replication location, and file I/O, MapR-XD uses three sizes. The importance of distinguishing these block sizes is in how the block is used.

  • Having a larger shard size means that the metadata footprint associated with files is smaller.
  • Having a larger replication location size means that a fewer number of replication location calls are made to the Container Location Database (CLDB).

Having a smaller disk I/O size means that files may be randomly read and written.

MapR-FS differentiates use

With MapR-DB tables, like with Hbase, continuous sequences of rows are divided into Regions or Tablets, however with MapR-DB the Tablets “live” inside a container. Because the tables are integrated into the file system MapR-DB can guarantee data locality, which HBase strives to have but can not guarantee since the file system is separate.

MapR-DB Tables

On a node failure the Container Location Database(CLDB) does a failover of the primary containers that were being served by that node to one of the replicas. For a given container, the new primary serves the MapR-DB tables or files present in that container.

Mapr-DB

The real differentiating feature is that MapR-DB can instantly recover from a crash without re-playing any of the small micro LOGs, meaning users can access the database while the recovery process begins. If data is requested which requires the application of a micro LOG, it happens inline in about 250 milliseconds, so fast that most users won’t even notice it. HBase and Cassandra use large recovery LOGs, MapR, on the other hand, uses about 40 small LOGs per Tablet. Since today’s servers have so many cores, it makes sense to write many small LOGs and parallelize recovery on as many cores as possible. Also, MapR-DB is constantly emptying micro LOGs, so most of them are empty anyway. This architecture means that MapR-DB can recover between 100X and 1000x faster from a crash than HBase. We call this Instant Recovery.

This summarizes the features of MapR-DB tablets living inside a container:

  • Guaranteed data locality: Table data is stored in files which are guaranteed to be on the same node because they are in the same container.
  • Smarter load balancing uses container Replicas: If a tablet needs to be moved for load-balancing , it is moved to a container replica, so the data is still local.
  • Smarter, simpler failover uses container replicas
    • If a tablet needs to be recovered for failover , it is recovered from a container replica , so the data (the LOGs and table data in files) are still local.

MapR Converged Data Platform

Three core services, MapR-XD, MapR-DB, and MapR Event Streams work together to enable the MapR converged data platform to support all workloads on a single cluster with distributed, scalable, reliable, high performance.

MapR Platform

This gave an in depth look at some but not all of the reasons for choosing MapR-DB, for more information explore the links below.

Top 10 Reasons Develops Choose MapR-DB

More information about Cassandra repair and maintenance:


This blog post was published September 29, 2017.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.