Starts in:

# Revealed: Why Hadoop Uses Three Replicas

###### Ted Dunning

There is some real math behind the idea that you need 3x replication in Hadoop. The basic idea is that when a disk goes bad, you lose an entire stripe of storage. Assuming that the contents are spread around the cluster fairly evenly, the contents of this stripe will be replicated by the rest of the cluster using a fraction of the network bandwidth. The amount of network bandwidth used depends on whether you are willing to dedicate the entire cluster to recovery versus still having throughput available.

From these factors, you can figure out how quickly the fault will be healed. During that time, there is some chance of losing r-1 more disks (where r is the replication). Since we can assume that at least some of the bits on the original lost stripe are found on every disk stripe in the cluster, losing these additional disks during the recovery time has a high probability of causing data loss.

Calculating the Rate for Data Loss Events

The probability of r-1 additional failures tells you the probability that any given disk failure will lead to data loss. The original failure rate will tell you roughly how many disk failures to expect in a year. Together, these calculations will give you the rate for data loss events. For general usage, I like to see a mean time between data loss (MTDL) greater than a century for sure, and preferably at or above 1000 years. With pretty reasonable assumptions about disk failure rates, 1 replica in a cluster gives <1 year MTDL, 2 gives about 10 years or so, and 3 replicas in a cluster provides >100 years MTDL.

You also need to think about chassis loss in similar terms. Unless there are a very large number of disks (larger than you can reasonably do with HDFS, but not more than is practical with MapR-FS), chassis loss is a very small contributor to MTDL.

The Case for Building a Discrete Event Simulation Model

The approximation above has problems when you have very few nodes, where there is synchronization in replica location, or where the replicas are concentrated on a few machines. To obtain a really good estimate, you need to build a complete discrete event emulation model that takes into account your data placement and replication policies, as well as the details of your particular system and hardware setup. This allows you to study/analyze the performance without actually running the algorithms on an actual cluster.

For instance, MapR typically gives you a maintenance window on the first failure before replicating data on a downed node, but then it knocks the doors down to replicate the data if there is a second failure. That behavior isn't captured well in the original argument above. I have built such a simulator and can vouch for the fact that it is a pretty significant piece of work.