In a Technical Insight paper entitled “Advancing Hadoop—MapR’s M7 Edition,” Evaluator Group looked at HBase as the engine that supports database applications running on Hadoop. The significance of HBase to the enterprise, we believe, cannot be understated. Enterprise Hadoop users will naturally want to move Hadoop beyond batch processes to supporting applications that process and analyze business transactions. The HBase NoSQL database is fully compatible with Hadoop and is therefore ideally suited to this purpose. Since we wrote the M7 review, which is recapped later in this paper, MapR has announced a set of performance benchmarks that demonstrate consistent HBase response times on MapR’s M7. This Evaluator Group Client Advisory Note reports these results and explores the reasons for them and their implications.
To address some of the operational and performance issues of the standard Apache distribution of HBase, MapR introduced its M7 Edition in 2012. M7 unifies the storage and processing of Hadoop HDFS-based files, as well as HBase tables, to a single platform.
The MapR M7 performance benchmarks were performed using Yahoo Cloud Serving Benchmark (YCSB) – a conventional standard for NoSQL performance testing. According to Yahoo Labs, the goal of the YCSB project is to develop a framework and common set of workloads for evaluating the performance of different “key-value” and “cloud” serving stores. A common use of the tool is to benchmark multiple systems and compare them.
The YCSB benchmark routine was used by MapR to measure M7 system latency as compared to Apache HBase on a ten-node cluster using a 50%/50% update/read workload.
MapR reports that latency for M7 was consistently between 12 and 42 millisecond with an average of 12.470. This is within the range of server/storage I/O latencies observed on commonly used enterprise NAS arrays supporting OLTP applications.
MapR believes this result to be at least four times better than that exhibited by a 10- node Apache Hadoop/HBase cluster running the same benchmark under the same conditions.
While Evaluator Group did not perform or audit the YCSB benchmark test reported above, we did interview an M7 user, Solutionary for real-world validation of performance. Solutionary is a managed security service provider (MSSP) that delivers managed security services and professional consulting services to mid-sized organizations, government entities and large, global enterprises to reduce risk, increase data security and support compliance initiatives2. Solutionary compared M7 to their existing Oracle RAC and found that they could replace the Oracle RAC system with a 32-node M7 cluster.
The reason for considering the replacement in the first place was simple economics. Data growth was described as “horrendous” due to the acquisition of new clients and the development of additional revenue producing services. M7’s ability to scale at lower cost made it an attractive alternative to Oracle RAC that limited growth due to cost.
Solutionary’s experience so far is that transactional performance is well within their service delivery requirements and that latency for getting data into system is minimal. In fact, they have yet to see a limitation there. HBase was not considered as an alternative for two reasons. First, because they process 4B lines of code per day, the fact that HBase is supported by Hadoop code written in Java was seen as an inhibitor. In addition, Apache Hadoop/HBase would have required the creation of two additional copies of data upon data ingest before the data could be made available for the continuous, real time processing Solutionary does for its customers. This requirement was simply not acceptable.
There are a number of factors that, when taken in aggregate, explain these results:
There are other factors that don’t necessarily contribute to performance, but do enhance M7’s suitability to production-grade transaction processing environments where outages can’t be tolerated. The M7 architecture does away with the HBase RegionServers mentioned above as well as the need for data compactions, which in HBase must be done on a regular basis to avoid performance degradation. Major compactions are manually controlled and require downtime because all HBase data is rewritten during major compactions. MapR M7 eliminates the need for such downtime.
The MapR Distribution for Hadoop, including the M7 Edition, is based on a distributed file system that eliminates many of the current shortcomings in HDFS that give enterprise IT administrators pause. We have seen from publically available survey data that upwards of 50% (conservative estimate) of all enterprise IT-level Hadoop projects fail or are put on hold. This is due to a number of factors including a lack of Java programmers on staff (Hadoop is written in Java as noted above while MapR’s Distribution for Hadoop is written in C/C++) and the limitations inherent in HDFS when viewed from the perspective of enterprise data center production environments. We have identified these limitations in previous research.
We also noted in the MapR M7 mentioned earlier that the convergence of HBase with MapReduce analytics under the Hadoop processing umbrella can offer significant advantages to enterprise IT administrators who are assessing Hadoop compared to the more traditional styles of database operations and data warehousing that have been in use for decades. These advantages include greater Hadoop-related hardware and network efficiency and cost advantages vs. the traditional enterprise data warehouse, coupled with a forward progression toward real-time online applications and analytics capabilities.
We then introduced MapR’s M7 as the first Hadoop-plus-NoSQL database distribution to address many of HBase’s limitations, including the considerable latency introduced by data movement into, out of, and within the Hadoop cluster. We noted that enterprises were certainly going to want to use Hadoop to analyze data generated by OLTP applications, but that data movement from these systems into Hadoop was an inhibitor. We concluded that by converging database processes with Hadoop’s MapReduce processes, one could support both database and analytics applications from the same processing cluster, thereby eliminating data movement.
But perhaps HBase’s biggest limitation from the standpoint of the enterprise user wanting to move Hadoop into a real time transaction processing and analytics environment is that of consistent and deterministic performance. Depending on a number of factors, that include disk I/O latency, I/O storms, and data locality, Apache HBase response times can diverge greatly—something that users of customer-facing transactional applications won’t tolerate. MapR’s convergence of a NoSQL database processing engine with Hadoop analytics advances the proposition that Hadoop can be used to support SQL-based OLTP applications in real time and in a production data center setting, and that analytic processes can be run against that data in place. These test results confirm that proposition.