7 min read
With ever larger clusters, maintaining high levels of reliability and availability is a growing problem for many enterprises. A particularly big concern is the reliability of storage systems. Failure of storage can not only cause temporary data unavailability but also, in the worst case, lead to permanent data loss. Additionally, technology trends and market forces may combine to make storage system failures occur more frequently in the future. The size of storage systems in modern, large-scale IT installations has grown to an unprecedented scale with thousands of storage devices, making component failures the norm rather than the exception.
A typical disk has an effective, useful life period of about 4-6 years.
Disk failure life cycle (source: “Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?”)
This may not appear to be so bad until you consider the fact that hard drive failure is the most common trigger for data loss. At least 38% of all data loss incidents are hard drive failures. According to BetaNews, this number can be as high as 72%.
Causes of Data Loss (source: a survey of 50 data recovery firms across 14 countries, DeepSpar Data Recovery Systems)
The cost of data loss depends on the application and the potential use of that data. In addition, there is a cost to recovery and lost productivity from downtime.
Data redundancy is SDS’s “go to” strategy in dealing with disk failures. In the SDS world, “disks are expected to fail.” Their ability to heal from failure forms their core offering. This sounds very good in theory. In practice, however, it is challenging to get right. A good SDS does the following:
Makes redundant copies easily available. Poorly arranged copies, such as under the same server node or disk pool, makes recovery a nightmare. Your data fabric should be able to distribute copies across nodes and racks for maximum availability.
Restores lost copies without the applications seeing any downtime.
Avoids consuming unbound resources to achieve desired replication. This implies SDS lets your choose how aggressively you want to achieve desired replication.
Gives clear alerts when replication falls below desired levels. This helps administrators take corrective actions if needed. For example, it would sound an alarm to replace a failed disk.
Distributes load to avoid bottlenecking. This ensures that when recovering lost data, most of the available storage servers contribute to recovery, instead of a select few.
Makes addition and deletion of resources seamless. Additionally, it handles dropped servers as well as disks. In the event of new storage servers and disks becoming available, they are absorbed seamlessly.
MapR XD is a distributed file and object store, underlying all of MapR product offerings. It was built from the ground-up to scale to over a thousand nodes. MapR XD intelligently distributes copies of data across available storage pools. A storage pool in MapR XD is a logical grouping of disks that can each be accessed by a cluster node. When configured correctly, MapR continues to work, even after a rack failure (not just a server or disk failure).
MapR offers an intelligent POSIX client that tracks all the copies of data on the platform. If it cannot fetch data from the master copy, it knows where the redundant copies are. It does not need to query a metadata server for every request.
Actual recovery times on disk failure will depend on many factors. Let's begin with the assumption that we have the following:
MapR has inbuilt smarts to ensure that any access to the lost disk is redirected to the copies of the data. Additionally, MapR defines two replication parameters: minimum replication and maximum replication. If the replication level falls below the maximum value but is still above the minimum value, MapR XD does nothing for 60 minutes. This gives time for nodes to self-heal. It also avoids reacting to pseudo-panic situations.
However, if the replication falls below the minimum replication factor, MapR gets to work right away. By default, MapR XD allocates only 20% of the network resources to restore desired replication. This is a tunable parameter and can be changed for conformance to varying SLAs.
When a disk fails on a MapR cluster, all the nodes in the cluster start contributing to restore the lost data to the replication level as defined in the SLA (recommended 3x). Thanks to the MapR smart data layout, the bigger the MapR cluster, the faster the data recovery time.
A 10 GbE network provides a bandwidth of 10/8, about 1.25 GB/sec. So each node can only read or write 250 MB per second (20% of 1.25 GB/sec). If MapR loses 4 TB of data from its storage pool, then to revert to the replication level prior to data loss, each node will have to contribute 4 TB/49 (node with failed disk has nothing to contribute). 4 TB/49 is about 83.6 GB. For each node to read or write 83.6 GB at 250 MB/sec is about 343 seconds or 6 minutes.
|Cluster Size||Data Lost (TB)||Approximate Time to Recover (Minutes)|
Approximate recovery times for different network and disk sizes)
By maintaining redundant copies of data and distributing it intelligently in your cluster, recovering from a disk failure is quicker. The bigger the cluster size, the quicker the recovery. Additionally, disk failure should not bring your storage fabric to a crawl. Recovery can happen in a non-disruptive way. Finally, disk failures are a part of life. With MapR, customers can meet SLAs and mitigate any risk arising from disk/server or rack failures.