14 min read
If you are a developer or architect working with a highly performant product, you want to understand what differentiates it, similar to a race car driver driving a highly performant car. My in depth architecture blog posts such as An In-Depth Look at the HBase Architecture, Apache Drill Architecture: The Ultimate Guide, and How Stream-First Architecture Patterns Are Revolutionizing Healthcare Platforms, have been hugely popular and I hope this one will be too.
In this blog post, I’ll give you an in-depth look at the MapR-DB architecture compared to other NoSQL architectures like Cassandra and HBase, and explain how MapR-DB delivers fast, consistent, scalable performance with instant recovery and zero data loss.
Simply put the motivation behind NoSQL is data volume, velocity, and/or variety. MapR-DB provides for data variety with two different data models:
Concerning data volume and velocity, recent ESG Labs analysis determined that MapR-DB outperforms Cassandra and HBase by 10x in the Cloud, with the operations/sec shown below: (high operations/sec is good)
The key is partitioning for parallel operations, and minimizing time spent on Disk reads and writes.
With a relational database you normalize your schema, which eliminates redundant data and makes storage efficient. Then you use indexes and queries with joins to bring the data back together again. However indexes slow down data ingestion with lots of non sequential disk I/O and joins cause bottlenecks on reads with lots of data. The relational model does not scale horizontally across a cluster.
With MapR-DB (HBase API or JSON API), a table is automatically partitioned across a cluster by key range, and each server is the source for a subset of a table. MapR-DB has a “query-first” schema design, queries should be identified first, then the row key should be designed to distribute the data evenly and also to give a meaningful primary index. The row document (JSON) or columns (HBase) should be designed to group data together that will be read together. With MapR-DB you de-normalize your schema to store in one row or document what would be multiple tables with indexes in a relational world. Grouping the data by key range provides for fast reads and writes by row key.
You can read more about MapR-DB HBase schema design here, and a future post will discuss JSON document design.
Traditional databases based on b-trees deliver super fast reads, but they are costly to update in real-time because the disk I/O they do is disorganized and inefficient.
(Image reference https://en.wikipedia.org/wiki/File:Btree_index.PNG)
Log Structured Merge trees (or LSM trees) are designed to provide a higher write throughput than traditional B tree file organisations by minimizing non-sequential disk I/O. LSM trees are used by datastores such as HBase, Cassandra, MongoDB and others. LSM trees write updates to a log on disk and memory. Updates are appended to the log, which is only used for recovery, and sorted in the memory store. When the memory store is full, it flushes to a new immutable file on disk.
This design provides fast writes however as more new files are written to disk, if queried row values are not in memory, multiple files may have to be examined to get the row contents, which slows down reads.
In order to speed up reads, by command or schedule, a background process “compacts” multiple files by reading them, merge sorting them in memory and writing the sorted keyValues into a new larger file.
However compaction causes lots of disk I/O which decreases write throughput, this is called write amplification. Compaction for Cassandra requires 50% free disk space, and will bring the OS to a halt if compaction runs out of disk space. Read and write amplification cause unpredictable latencies with HBase and Cassandra. Recently ESG Labs confirmed that MapR-DB outperforms Cassandra and HBase by 10x in the Cloud, with predictable consistent low latency( low latency is good, means fast response).
LSM Trees allow faster writes than B-trees, but they don’t have the same read performance. Also, with LSM trees, there is a balance between compacting too infrequently (read performance can be impacted) or too often (write performance can be impacted). MapR-DB strikes a middle ground and avoids the large compactions of HBase and Cassandra as well as avoiding the highly random I/O of traditional systems. The result is rock-solid latency and very high throughput.
To understand how MapR-DB can do what others can’t, you have to understand the MapR converged platform. Apache HBase and Apache Cassandra run on an append-only file system, meaning it isn’t possible to update the data files as data changes. What’s unique about MapR is that MapR-DB tables, MapR Files and MapR-Event Streams are integrated into the MapR-XD high-scale, reliable, globally distributed data store. MapR-XD implements a random read-write file system natively in C++ and accesses disks directly, making it possible for MapR-DB to do efficient file updates instead of always writing to new immutable files.
MapR-DB invented a hybrid LSM-tree / B-tree to achieve consistent fast reads and writes. Unlike a traditional B-Tree, leaf nodes are not actively balanced, so updates can happen very quickly. With MapR-DB updates are appended to small “micro” logs, and sorted in memory. Unlike LSM-trees which flush to new files, with MapR-DB micro-reorganizations happen when memory is frequently merged into a read/write file system, meaning that MapR-DB does not need to do compaction! (Micro logs are deleted after the memory has been merged, since they are no longer needed.) MapR-DB reads leverage a B-Tree to “get close, fast” to the ranges of possible values, then scan to find the matching data.
How do HBase and Cassandra do recovery?
Before I said that the log is only used for recovery, so how does recovery work ? In order to achieve reliability on commodity hardware, one has to resort to a replication scheme, where multiple redundant copies of data are stored. How does HBase do recovery? HDFS breaks files into blocks of 64 MB. Each block is stored across a cluster of DataNodes. and replicated 2 times.
If an HDFS data node crashes, then the region will be assigned to another data node. The new Region is available after the Log has been replayed, which means reading all of the not yet flushed updates from the Log into memory and flushing them as sorted key values into new files.
The HBase recovery process is slow and has the following problems:
So how does recovery work for Cassandra? With Cassandra, the client performs a write by sending the request to any node, which will act as the proxy to the client. This proxy node will locate N corresponding nodes that hold the data replicas and forward the write request to all of them. If a node fails the coordinator still writes to the other replicas and the failed replica becomes inconsistent.
According to Datastax downed nodes are common causes of data inconsistency with Cassandra, and need to be routinely fixed by manually running an anti-entropy repair tool. According to Robert Yokota from Yammer, Cassandra has not been more reliable than their strongly consistent systems, yet Cassandra has been more difficult to work with and reason about in the presence of inconsistencies.
To understand how MapR-DB can do what others can’t you have to understand the MapR file system replication with 24/7 reliability and zero data loss. In normal operations, the cluster needs to write and replicate incoming data as quickly as possible. When files are written to the MapR cluster, they are first sharded into pieces, called 'chunks.' Each chunk is written to a container as a series of blocks of 8K at a 2gig/second update rate. Once the chunk is written, it is replicated as a series of blocks. This is repeated for each chunk until the entire file has been written. (You can see this animated in this MapR-XD video) MapR ensures that every write that was replied to survives a crash.
Where HDFS uses a single block size for sharding, replication location, and file I/O, MapR-XD uses three sizes. The importance of distinguishing these block sizes is in how the block is used.
Having a smaller disk I/O size means that files may be randomly read and written.
With MapR-DB tables, like with Hbase, continuous sequences of rows are divided into Regions or Tablets, however with MapR-DB the Tablets “live” inside a container. Because the tables are integrated into the file system MapR-DB can guarantee data locality, which HBase strives to have but can not guarantee since the file system is separate.
On a node failure the Container Location Database(CLDB) does a failover of the primary containers that were being served by that node to one of the replicas. For a given container, the new primary serves the MapR-DB tables or files present in that container.
The real differentiating feature is that MapR-DB can instantly recover from a crash without re-playing any of the small micro LOGs, meaning users can access the database while the recovery process begins. If data is requested which requires the application of a micro LOG, it happens inline in about 250 milliseconds, so fast that most users won’t even notice it. HBase and Cassandra use large recovery LOGs, MapR, on the other hand, uses about 40 small LOGs per Tablet. Since today’s servers have so many cores, it makes sense to write many small LOGs and parallelize recovery on as many cores as possible. Also, MapR-DB is constantly emptying micro LOGs, so most of them are empty anyway. This architecture means that MapR-DB can recover between 100X and 1000x faster from a crash than HBase. We call this Instant Recovery.
This summarizes the features of MapR-DB tablets living inside a container:
Three core services, MapR-XD, MapR-DB, and MapR Event Streams work together to enable the MapR converged data platform to support all workloads on a single cluster with distributed, scalable, reliable, high performance.
This gave an in depth look at some but not all of the reasons for choosing MapR-DB, for more information explore the links below.
More information about Cassandra repair and maintenance:
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.