Data Tiering: A Capacity and Performance Juxtaposition

Contributed by

9 min read

Balance

Data is the bloodline of computing. As long as it flows seamlessly across a myriad of devices and software layers, life is good. To keep this data flowing, you need the ability to separate relevant data and make it available to applications.

Applications, such as those for ML/AI, are better off crunching numbers than waiting for the data to arrive (I/O bound). This can be achieved by 1) using faster I/O media and 2) spending less time looking for relevant data. A third, smarter approach is to do a combination of both. Enter storage tiering.

Traditional Storage Tiering

Storage tiering is not new. Cost vs. performance has been the driving force behind the layering of data. Computer architectures were designed to have three storage tiers. Caches inside the CPU are the fastest, along with the registers that are like a scratchpad, used for executing program instructions. These caches access fast random-access memory (RAM) over a memory bus. Secondary storage (disk) forms the next tier, followed by tertiary storage devices, such as tape drives.

Storages

Computer Data Storage. Source: Wikipedia.

Scale-Out Modern-Day Data Platform

A modern-day data platform is an extension of the traditional model. As storage environments grow, an automated tiered platform is necessary, since 1) manual data movement is painstakingly slow, 2) the amount of data is always increasing, and 3) limited resources, such as network bandwidth, compute, and memory, leave administrators stretched too thin.

A tiered approach helps organizations optimize their infrastructure using a combination of storage solutions to lower costs, increase performance, and scale.

Performance/Capacity pyramid

Performance vs. Capacity

Data tiering has evolved at both ends. Faster non-volatile memory express (NVMe)-based flash storage offers new performance options to solid-state drives (SSDs). NVMe makes use of a PCI interface, as opposed to the SATA interface used by SSDs, to drive parallelism and hence performance. On the other hand, at 2 cents per GB, cloud providers offer a compelling capacity option. Cloud-based archival storage is offered on inexpensive disks, and often data from many users resides on shared drives to push costs down. Storage tiers can be categorized as follows:

Performance Tier (Tier 1): Highly available with the best possible performance describes this tier. Frequently accessed, most valuable business-critical data resides in this tier. NVMe, SSDs, and high performing disks can be used here. Zero downtime with data redundancy is expected from this tier. Though most expensive on a $/GB basis, this tier is the most cost-effective for IOPS/$.

Tiers

Capacity Tier (Tier 2): This supports a broad range of business applications, such as emails, ERP systems, and backups. It is desirable to have storage optimization techniques, such as erasure coding (EC) and compression, enabled at this layer. This tier offers a balance between cost, performance, and availability. Tier 2 solutions must securely store active business data, when a sub-second response is not necessarily a requirement, but a reasonably fast response is expected. Disk arrays are the storage devices of choice for a capacity tier.

Archive Tier (Tier 3): Historically, this tier was catered to by tape devices. Tapes are still around, but there is a huge growth in cloud-based capacity. Pioneered by Amazon, the Simple Storage Service (S3) has evolved as a de facto protocol to store data as objects in the cloud. An archive tier retains data that have infrequent access patterns, yet must be protected and kept for prolonged periods of time.

SDS and Tiering

The advantage of software-defined storage (SDS) is the flexibility it offers on sizing the tiers to the application needs. Unlike appliance vendors, SDS vendors do not lock you in. You can pick and choose your data tier and its corresponding capacity. Moreover, you can fine-tune the provisioning to varying business needs. For example:

  1. Mission-critical databases have very high I/O performance needs. This data is critical to revenue generation and often holds critical customer data. Use of flash-based storage provisioned with redundant copies is critical. More Tier 1 storage can be provisioned for such applications.
  2. Email archives can make use of erasure coded, hard drive-based storage. Active emails can still be in Tier 1 and that can be provisioned to be much smaller. Privacy policies permitting some of it can be offloaded to the cloud.
  3. Long-term retention data, such as the ones for scientific data, video surveillance, compliance, or medical data, can be stored in cloud. For such applications, a much smaller Tier 1 and Tier 2 should suffice.

Application-aware Data Tiering

Examples of Application-Aware Data Tiering

MapR Advantage

MapR XD is a cloud-scale distributed file and object store, built from the ground up. MapR XD makes it easy to store data at exabyte scale and supports trillions of files, provides enterprise-grade features to be the system of record for large global enterprises, and uniquely combines analytics and operations into a single platform, enabling intelligent application development.

MapR offers data tiering capabilities in line with industry's leading solutions. It ingests file data into the "hot" (performance) tier. Then, depending on customer-defined rules and schedules, the file data is offloaded to the "warm" (capacity) or "cold" (archive) tier. MapR Automated Storage Tiering (MAST) handles all of the data movement between the tiers. Some of these capabilities are planned for the upcoming 6.1 release of the MapR Data Platform.

MAST capabilities:

  • Is distributed and can scale to handle varying volumes of data
  • Moves files across tiers, based on predefined rules and schedules
  • Is aware of security, compression, and performance needs
  • Can automatically provision for cloud, such as create buckets, convert file data into objects, etc.
  • Understands erasure coding schemes and can stripe file data accordingly
  • Maintains statistics on data transfers for further insights

Data in the MapR performance tier is highly available and resilient for faster, reliable access. MapR Erasure Coding for the warm tier offers cost-optimized capacity with data protection. Erasure coding can handle up to 3 failures; with RAID, you need expensive hardware; with RAID 6, it has capabilities to handle 2 failures. With MapR, customers can choose from multiple schemes of striping and parity.

EC schemes

Erasure Coding (EC) schemes, its fault tolerance, overhead, and minimum topology

Having multiple erasure coding schemes makes it possible to define multiple tiers for warm data. If you need high reliability, a 6+3 EC scheme may be better suited than a 4+2 scheme. If you are looking for less storage overhead, a 5+2 EC scheme may work better. With MapR, administrators have the flexibility to tune EC for applications, while keeping the costs low, without compromising performance SLAs.

MapR Storage Tiering

MapR Storage Tiering

MapR allows customers to create the cold tier on a cloud vendor of their choice. This avoids vendor lock-in and eliminates configuration, setup, and metadata management overhead.

Summary

Tiering was and still is an intelligent solution for managing ever-increasing data growth. It is possible to design effective data tiering without locking into expensive appliances. With MapR, you can organize data into the hardware of your choice by specifying simple rules and schedules. MapR Automated Storage Tiering eliminates the need to move data manually while intelligently scaling and translating it for the cloud. With MapR, capacity and performance can be achieved simultaneously. Erasure coding for capacity tier makes a compelling choice, due to its resilience to failures and storage efficiency.

Reference

  1. PCIe SSD 101: An Overview of Standards, Markets, and Performance
  2. RAID vs. Erasure Coding
  3. Computer Data Storage

This blog post was published June 25, 2018.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now