Disaster recovery (DR) is the science of returning a system to operating status after a site-wide disaster. DR enables business continuity for significant data center failures for which high availability features cannot cover. Computer systems generally support DR in two ways: backups and replication. Backups entail full or partial copies of data from the master cluster that are stored on separate media. Replication, also known as mirroring, continuously copies data from the master cluster to a geographically remote instance of the system (“replicas” or “mirrors”). For production deployments, mirroring is the preferred strategy for DR. With either method, a copy of the data is available to restore and thus recover from the disaster. Backups involve restoring the saved data into an alternate cluster and enabling that cluster as the new master. DR with mirroring entails activating the mirror, which already has the data loaded, as the new master cluster. (Note that “replication” is also used to refer to the copying of data within a cluster in a data center to eliminate single points of failure and enable high availability.)
In a related area, some systems support point-in-time snapshots, also known as checkpoints, to allow rolling data back to a prior state. This feature is generally used to recover from data corruption due to application or user error. For more information, please see the MapR Snapshots tech brief.
DR requires planning to determine two objectives. The recovery point objective (RPO) is a planned estimate on how much data the organization can afford to lose in case of a disaster. In other words, this is a measure of the level of potential data loss. The recovery time objective (RTO) is the amount of time the organization can be on hold while the system is being recovered. This is a measure of potential downtime. These two objectives indicate that DR is a sliding scale, so organizations must plan how much cost and effort should be applied to limit data loss. Lower RPO and RTO values enable greater protection against data loss and downtime, but those will take more resources to implement. Backups tend to be the much cheaper option, but consequently result in both high RPO and RTO. Mirroring is more expensive due to the redundant hardware in the remote mirrors, but enables lower risk of data loss.
The MapR Converged Data Platform includes backup and mirroring capabilities to protect against data loss after a site-wide disaster. MapR is the only big data platform that provides built-in, enterprise-grade DR for files, databases, and events. MapR was built to address real-world DR scenarios where lost data and downtime result in lost revenue, lost productivity, and/or failed opportunities.
To create backups, administrators first take a snapshot of the MapR cluster at the volume level. The snapshot will include all data in the volume, including files, MapR Database database tables and documents, and MapR Event Store topics. The snapshot completes in a few seconds and represents a consistent view of the data. This means that unlike other big data platforms, the state of the snapshot will always be the same. The snapshot then can be written to another medium as a backup.
In other big data platforms, snapshots might change over time, depending on the state of open files when the snapshot was taken. Also, partially written files won’t be captured when the snapshot is taken, making it difficult to create an accurate backup.
To create remote replicas, MapR provides two features that enable DR for different use cases: Mirroring and table and stream replication.
MapR mirroring is used to create remote mirrors of files. Mirroring supports the following characteristics that are critical for proper DR deployments:
Table and stream replication is the (near) real-time mechanism for replicating the data in MapR Database database tables and the data in MapR Event Store topics. Since database and topic updates tend to occur much more frequently, rapidly, and granularly than file updates, this feature is required to minimize the differential between the master data and the replicas. Table and stream replication has the following advantages:
Once you’ve determined your DR strategy, and thus your RPO and RTO requirements, you can leverage MapR features to support that strategy. Assuming you have a business-critical environment, this discussion will skip the backup option and instead focus on mirroring and table and stream replication. In most big data deployments, especially on MapR, a combination of files, database tables, and streaming will be used, so using both features will enable a robust DR implementation.
Achieving Low RPO with Scheduled Mirroring
For files in your MapR cluster, use mirroring on a scheduled basis to ensure remote mirrors frequently get the latest updates. The window of potential data loss depends on how frequently your mirroring operations are scheduled.
For an extra level of DR protection, such as to guard against multiple data center failures, use of different mirroring topologies including a cascaded mirror chain will create multiple remote copies. Cascaded mirror chains are also useful for creating more efficient delivery of mirror updates. For example, if your master cluster is in New York, and you want to mirror to Sydney and Singapore, it would make sense to mirror from New York to Sydney, and then have a separate mirror chain from Sydney to Singapore.
With database tables and streams topics, you automatically get low RPO since table and stream replication continuously transfers all database and topic updates to the remote clusters. This ensures that the master database and replica databases are closely synchronized. The window of potential data loss is never more than a few seconds.
MapR remote mirrors are initially read-only to prevent inadvertent writes to the replica that result in inconsistency between master and mirror. But should a disaster occur, the mirror needs to be enabled as the (temporary) master cluster. The Promotable Mirrors feature lets you quickly activate (or “promote”) a mirror into a read/write state, thus enabling it for use as the new master cluster. This means that the bulk of the RTO time will entail redirecting users at the network or application level to the new master cluster
Since table and stream replication ensures tight synchronization of the master database tables and MapR Event Store topics with the replica tables and topics, and those replica tables and topics are already read/write enabled, no additional effort is required to activate a replica as the master. This means that as above, the bulk of the RTO time will entail redirecting users at the network or application level to the new master cluster.
When running a production deployment for big data, some of the same business continuity practices that you’ve applied in your existing data architecture must be used. Should you face a site-wide disaster, you want to make sure you have a strategy in place to minimize data loss and downtime. With the MapR Converged Data Platform, you get the enterprise-grade disaster recovery capabilities that you would expect from any production-grade software system. MapR lets you define low recovery point objectives and recovery time objectives to meet your business requirements, while also minimizing the administrative overhead to achieve those objectives.