10 min read
A core-differentiating component of the MapR Distribution including Apache™ Hadoop® is the MapR File System, also known as MapR-FS. MapR-FS was architected from its very inception to enable truly enterprise-grade Hadoop by providing significantly better performance, reliability, efficiency, maintainability, and ease of use compared to the default Hadoop Distributed Files System (HDFS). And, we are excited to point out that Google Capital, Qualcomm Ventures, and our prior investors—Lightspeed Venture Partners, Mayfield Fund, NEA, and Redpoint Ventures,—all just validated this with a significant investment in MapR.
Having been at MapR for 2.5 years, a common question that I get from customers is, “Isn’t HDFS going to eventually catch up to MapR-FS?” The simple answer is a resounding “NO”, and the reasons lie in the foundations of the two architectures. I will first describe these differences and then outline how the corresponding feature implementations vastly differ in their value to customers.
The picture below highlights the fundamental architectural differences as to why MapR-FS is far superior to HDFS:
So why do these architectural differences matter? And with HDFS 2.0 introducing NFS, Snapshots, NameNode Federation for scalability, NameNode HA, and claiming better performance, hasn’t HDFS caught up with MapR? Well, let’s peel back the onion and we’ll let you decide which is the architecture that you want supporting your mission-critical and production Hadoop-based applications. In Part 1 of this 3-part blog series, we’ll cover NFS and Snapshots.
Network File System (NFS)
Let’s first look at how MapR natively implemented a Network File System (NFS) interface to MapR-FS so that any application existing today that reads and writes from and to a file system, whether it be to a local file system, Network Attached Storage, or a Storage Area Network, can read and write data from and to MapR-FS. Per the above diagram, MapR is a fully read/write file system, pretty much like any other file system that you’ve encountered, except of course CD-ROMs, FTP, and yep, HDFS. Why is this an issue? The NFS protocol requires a file system that can handle random writes; this is something that MapR-FS can do, but HDFS cannot. Therefore, the HDFS NFS Gateway has to save all the data to a temporary directory (/tmp/.hdfs-nfs by default) on the local file system (ext3, ext4, XFS, etc... as they are read/write) on the given node prior to writing it to HDFS. This is needed because HDFS doesn't support random writes, and the NFS client on many operating systems will reorder the write operations even when the application is writing sequentially.
To quote the documentation from a MapR competitor that is HDFS-based; “NFS client often reorders writes. Sequential writes can arrive at the NFS gateway at random order. This directory is used to temporarily save out-of-order writes before writing to HDFS. One needs to make sure the directory has enough space. For example, if the application uploads 10 files with each having 100MB, it is recommended for this directory to have 1GB space in case if a worst-case write reorder happens to every file.” There are two implications to this workaround:
To sum it up, basically every production MapR customer is using our NFS to simplify the reading and writing of data from and to a MapR cluster because of the ease and real-time aspects of the MapR native implementation, whereas the HDFS NFS Gateway is simply a checkbox for HDFS-based distributions to say they have NFS, although there are virtually zero production implementations.
In looking at moving Hadoop into a production environment, you may want to ask yourself, “Do I want an NFS implementation that is native and part of the very inception of the given file system, or an inadequate implementation that was added on several years later in an attempt to “catch up” with the technology leader?
For more details, please see the MapR Direct Access NFS™ technical brief at mapr-technicalbrief-direct-access-nfs-_161025.pdf
Snapshots are an incredibly efficient and effective approach in protecting data from accidental deletion and corruption due to errors in applications, without the need to actually copy the data within a cluster or to another cluster. The ability to create and manage Snapshots is an essential feature expected from enterprise-grade storage systems, and this capability is increasingly seen as critical with big data systems. A Snapshot is a capture of the state of the storage system at an exact point in time, and is used to provide a full recovery of data when lost. For example, with MapR, you can snapshot petabytes of data in seconds, as we simply maintain pointers to the locations of blocks that make up a Volume of data as opposed to the need to physically copy those petabytes into the local cluster or a remote one.
MapR Snapshots are “consistent” which means the second that you take one (on-demand or scheduled), you will have an exact copy of what your data looked like at that point in time, until you deem the Snapshot no longer needed manually or through an automated retention policy. It’s “consistent” because the data never changes, so you can always look back at a given Snapshot with certainty that the data will look exactly the same as when you took the Snapshot.
HDFS Snapshots, on the other hand, are “inconsistent” in that you may take a Snapshot at a point in time, but the data within the Snapshot can easily change over time. This is because, unlike MapR that places file system metadata with its associated data, HDFS separates the metadata and stores it on the NameNode, and then places the data on different nodes known as Data Nodes. This causes synchronization problems in that an HDFS Snapshot is only of the metadata, not the actual data and file length.
Consider taking a snapshot with your camera. A consistent snapshot (like a MapR Snapshot) means that when you take a picture of two of your friends and go back and look at it a few seconds, minutes, hours, or days later, you still only see two friends in the picture. With inconsistent Snapshots (like HDFS’s), when you go back and look at the picture a few seconds, minutes, hours, or days later, there may be three, four, or five of your friends in the picture instead of the original two. So when you run analysis on your data with inconsistent Snapshots, your timestamp may say one thing while your data may say another.
Once again, in looking at moving Hadoop into a production environment, you may want to ask yourself, “Do I want a Snapshot implementation that is native and part of the very inception of the given file system, or an inadequate implementation that was added on several years later in an attempt to “catch up” with the technology leader?”
For more details, please see the MapR Snapshots technical brief at mapr-tech-brief-snapshots-161025.pdf.
The HDFS Snapshot JIRA took over five years to develop from start to finish (starting as HDFS-233 and rolling into HDFS-2802); yet the final product is not enterprise-ready. Why is this? The engineers that worked on the JIRA are very, very good, and have contributed an incredible amount to Hadoop, so it’s certainly not them. But, it is the architecture that they were forced to work with. HDFS was designed 10 years ago for one use case, crawling the web and indexing it with MapReduce, and it was great at that. But 10 years later, customers are requiring their big data and Hadoop implementations to run mission- critical applications with enterprise storage functionality, and you need the right foundation to do this. MapR-FS is that foundation.
Please see the demo that compares HDFS NFS and Snapshots to MapR NFS and Snapshots here:
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.