Get Real with Hadoop: Complete Data Protection and Disaster Recovery

Contributed by

6 min read

In this blog series, we’re showcasing the top 10 reasons customers are turning to MapR in order to create new insights and optimize their data-driven strategies. Here’s reason #5: MapR provides complete data protection and disaster recovery with real snapshots and mirroring.

In a nutshell, no Hadoop distribution apart from MapR can provide you with the technology to implement business SLAs around Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO). Here are the details.

A core-differentiating component of the MapR Distribution including Apache™ Hadoop® is the MapR Distributed File and Object Store, also known as MapR XD. MapR XD was architected from its very inception to enable truly enterprise-grade Hadoop by providing significantly better performance, reliability, efficiency, maintainability, and ease of use compared to the default Hadoop Distributed File System (HDFS). MapR XD advanced architecture uniquely enables Snapshots and Mirroring—features that are core to supporting your RPO and RTO for Hadoop.

Implement RPO with Snapshots
Snapshots are an efficient and effective approach in protecting data from accidental deletion and corruption due to errors in applications, without the need to actually copy the data within a cluster or to another cluster. The ability to create and manage Snapshots is an essential feature expected from enterprise-grade storage systems, and this capability is increasingly seen as critical with big data systems. A Snapshot is a capture of the state of the storage system at an exact point in time, and is used to provide a full recovery of data when lost. For example, with MapR, you can snapshot petabytes of data in seconds, as we simply maintain pointers to the locations of blocks that make up a Volume of data, as opposed to the need to physically copy those petabytes into the local cluster or a remote one.

MapR Snapshots are “consistent” which means the second that you take one (on-demand or scheduled), you will have an exact copy of what your data looked like at that point in time, until you deem the Snapshot no longer needed manually or through an automated retention policy. It’s “consistent” because the data never changes, so you can always look back at a given Snapshot with certainty that the data will look exactly the same as when you took the Snapshot.

HDFS Snapshots, on the other hand, are “inconsistent” in that you may take a Snapshot at a point in time, but the data within the Snapshot can easily change over time. This means an HDFS snapshot at 11:59 pm on Sunday could very well have data from 2:20 am on Monday —hardly a consistent point-in-time recovery capability. This happens because, unlike MapR that places file system metadata with its associated data, HDFS separates the metadata and stores it on the NameNode, and then places the data on different nodes known as Data Nodes. This causes synchronization problems in that an HDFS Snapshot is only of the metadata, not the actual data and file length.

Here is a detailed video that showcases the difference between MapR Snapshots and Snapshots with other distributions.

Implement RTO with Mirroring
Mirroring is the capability to keep your Hadoop data in sync across two different clusters. These clusters could be used for disaster recovery across different production sites—on-premise or cloud—or could simply be clusters that keep R&D environments in sync with production data. Mirroring has proven to be an extremely critical feature for MapR customers who require disaster recovery capabilities for Hadoop.

Mirroring, like Snapshots, can be set up quite easily using the MapR Control System and scheduled to run at preset time slots. Moreover, by leveraging the MapR volumes feature, different data sets can be custom-scheduled to enable different levels of recovery objectives.

MapR internally utilizes the same Snapshot technology described above to only capture data that has changed—at the file-block level—since the last data transfer. Once the data differential is identified, it is then compressed and transferred over the WAN to the recovery site, using very low network bandwidth. Finally, checksums are utilized to ensure data integrity across the two clusters. Similar to Snapshots, there is no performance penalty on the cluster because of Mirroring.

Compare this mechanism with an HDFS-based disaster recovery option—where full copies of all touched files are sent over the WAN at set intervals. The amount of data duplication, the level of network bandwidth usage, and the inefficiencies it creates when you actually have to recover the application, are detrimental to the point that such a feature becomes completely unusable.

As you can see, MapR truly enables business continuity for Hadoop at the level you would expect from your enterprise-grade software platforms. This is where MapR brings in the real difference for you—nobody else comes close.

Be sure to check out the complete top 10 list here.

This blog post was published October 11, 2014.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now