Get Real with Hadoop: World's Fastest Hadoop

Contributed by

9 min read

In this blog series, we’re showcasing the top 10 reasons customers are turning to MapR in order to create new insights and optimize their data-driven strategies. Here’s reason #2: MapR provides world record performance for Hadoop.

Bruce Lindsay, venerable IBM database researcher once said, "There are three things important in the database world: performance, performance, and performance." That's true of Hadoop as well. Almost thirty years ago, Turing award winner Jim Gray started the database performance benchmark with TPC. He also started and maintained sort benchmarks. In the 80s and 90s, every chip, every system, every database strived to get the TPC benchmark. From transactions to transformations, sorting to scaling, TPC benchmarks pushed the hardware and software industry to break barriers continuously.

Use cases may have changed for Hadoop, but the need for speed hasn't. Hadoop is now pushing the performance boundary on all fronts: analytical jobs, operational jobs, data import and export, and node recovery. Now, let's look at the proof points and reasons why the MapR Distribution including Apache Hadoop is the world's fastest Hadoop.

1. World record under ten dollars.
A couple of years ago, MapR completed the sorting of 1 TB of randomly generated numbers in 54 seconds using 1003 nodes on Google Compute Engine and the MapR Distribution including Hadoop. The sorting was run on 4012 cores, 1003 disks and 1003 network ports. All this cost $9, compared to the previous record holder that used $5 million worth of hardware. See the real-time recording of this work with explanation here.

As the volume and velocity of data increase, performance and efficient use of resources will matter even more. Not resting on the TeraSort record, the MapR team went on to set the MinuteSort record. A minute may have 60 billion nanoseconds; MapR showed you can do a lot with it. Using the 2103 nodes on Google Compute Engine, the MapR Distribution including Hadoop sorted 15 billion 100-byte records, totaling 1.5 terabytes in 59 seconds. See details about this record here. (Note that this record was subsequently broken by a MapR customer and the record now sits at 1.65TB of data sorted in one minute.)

2. Ready for Internet of Things.
Internet of Things (IoT) is really internet of data. Seamless data exchange from devices to fog layer to cloud makes IoT click. Devices like Fitbit collect and transmit your health statistics. Machines in factories and instrumented vehicles all generate data, helping failure prediction and prevention. In these cases, every data point has a timestamp associated with it. Hence, this is called time series data. OpenTSDB on MapR Database helps to model, manage and analyze this time series data.

The data deluge from billions of diverse sensor data brings volume, velocity and variety into focus. The first order of business here is to ensure we can ingest the data. Using just four nodes from PSSC labs, they were able to ingest at the rate of 100 million data points per second. With that rate, a single data point for every single man, woman and child today can be ingested in under 72 seconds! Incredible. See details about loading a time series database here.

For more information on open source tools such as OpenTSDB and new modifications that greatly speed up data ingestion, read the latest ebook, Time Series Databases: New Ways to Store and Access Data, by Ted Dunning and Ellen Friedman.

3. MapR-File System (MapR XD) takes Hadoop to next level.
Hadoop consists of an HDFS file system and the MapReduce execution engine. The Hadoop file system, HDFS, has a number of architectural bottlenecks affecting its performance. First, it's written in Java and sits on top of the Linux ext3/ext4 file system. Every application I/O operation goes through two layers of software: Java and then the Linux file system. Hence, HDFS performance is a small fraction of the hardware problem.

MapR XD is a 100% HDFS API-compatible distributed file system and runs all the popular open source Hadoop frameworks and packages. It starts with a solid lockless storage service as the base. MapR XD then takes on the infamous NameNode problem on HDFS and eliminates it. In MapR XD, every node in the system manages a portion of metadata as well as user data. MapR XD is a fully distributed file system, supporting not just append, but random read-write as well. It's implemented in C/C++ and operates directly on the storage without going thru ext3/ext4 Linux file system. Hence, each operation on MapR XD has less work to do. Removing the Java dependency also removes events like garbage collection affecting the latency of the file system.

Big data implies creating large number of files on the Hadoop cluster. Performance tests such as DFSIO have shown, MapR XD can create thousands of files per second and can support trillions of files in a single cluster. This enables you to load large amounts of data into Hadoop much faster.

4. Scalable resync.
MapR provides a scalable resync capability that does not compromise between reliability and performance. When you run hundreds or thousands of nodes running commodity hardware, disk or node failure is inevitable. In Hadoop, hardware can fail, but not the jobs. So, recovery speed from node failure is critical. If you have a 100-node cluster, each node's data is divided into small blocks and spread across other 99 nodes. When a node is lost, every other node is re-syncing 1/100th of data. As the cluster size increases, number of nodes participating in the re-sync increases speeding up the process.

In addition, MapR XD replicates the data synchronously. On node failure, the recovery process detects the exact point of divergence and rolls forward or rolls back only the changes since the last consistent point. These factors point to the architectural groundwork that was laid by MapR in the building a high performance Hadoop platform that is also scalable and reliable.

5. MapR Database: Wide-column, high-performance.
MapR Database does to HBase what MapR XD did to HDFS—improve performance, reliability and ease of use. MapR Database is a wide column distributed NoSQL database integrated into the MapR data platform. It shares the Google BigTable approach to the data model with HBase and implements the HBase API. It has no region servers to manage, no compactions to schedule, no garbage collections to tolerate. Each portion of the table maintains its own write ahead log in each node. Hence, MapR Database can ingest at incredible speed. MapR Database also shows significant performance improvement over HBase across read intensive, write intensive and balance workloads.

The complete performance report for MapR Database is available here.

While MapR Database supports real-time applications, the data in MapR Database can be used by analytical workloads like MapReduce, Apache Hive and Apache Drill queries. This provides a single platform for your operational and analytical workload, eliminating need for large data movement. As they say, the best performing task is the one you don’t have to do.

In the word MapR, if the letters M and R stand for Map-Reduce, the remaining letters "ap" should stand for availability and performance. I'll leave you with Muhammad Ali's quote in our MapR kitchen that inspires MapR engineers to reach even higher performance: "I'm so fast that last night I turned off the light switch in my hotel room and was in bed before the room was dark."

Be sure to check out the complete top 10 list here.

This blog post was published October 14, 2014.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now