9 min read
Data is the bloodline of computing. As long as it flows seamlessly across a myriad of devices and software layers, life is good. To keep this data flowing, you need the ability to separate relevant data and make it available to applications.
Applications, such as those for ML/AI, are better off crunching numbers than waiting for the data to arrive (I/O bound). This can be achieved by 1) using faster I/O media and 2) spending less time looking for relevant data. A third, smarter approach is to do a combination of both. Enter storage tiering.
Storage tiering is not new. Cost vs. performance has been the driving force behind the layering of data. Computer architectures were designed to have three storage tiers. Caches inside the CPU are the fastest, along with the registers that are like a scratchpad, used for executing program instructions. These caches access fast random-access memory (RAM) over a memory bus. Secondary storage (disk) forms the next tier, followed by tertiary storage devices, such as tape drives.
Computer Data Storage. Source: Wikipedia.
A modern-day data platform is an extension of the traditional model. As storage environments grow, an automated tiered platform is necessary, since 1) manual data movement is painstakingly slow, 2) the amount of data is always increasing, and 3) limited resources, such as network bandwidth, compute, and memory, leave administrators stretched too thin.
A tiered approach helps organizations optimize their infrastructure using a combination of storage solutions to lower costs, increase performance, and scale.
Performance vs. Capacity
Data tiering has evolved at both ends. Faster non-volatile memory express (NVMe)-based flash storage offers new performance options to solid-state drives (SSDs). NVMe makes use of a PCI interface, as opposed to the SATA interface used by SSDs, to drive parallelism and hence performance. On the other hand, at 2 cents per GB, cloud providers offer a compelling capacity option. Cloud-based archival storage is offered on inexpensive disks, and often data from many users resides on shared drives to push costs down. Storage tiers can be categorized as follows:
Performance Tier (Tier 1): Highly available with the best possible performance describes this tier. Frequently accessed, most valuable business-critical data resides in this tier. NVMe, SSDs, and high performing disks can be used here. Zero downtime with data redundancy is expected from this tier. Though most expensive on a $/GB basis, this tier is the most cost-effective for IOPS/$.
Capacity Tier (Tier 2): This supports a broad range of business applications, such as emails, ERP systems, and backups. It is desirable to have storage optimization techniques, such as erasure coding (EC) and compression, enabled at this layer. This tier offers a balance between cost, performance, and availability. Tier 2 solutions must securely store active business data, when a sub-second response is not necessarily a requirement, but a reasonably fast response is expected. Disk arrays are the storage devices of choice for a capacity tier.
Archive Tier (Tier 3): Historically, this tier was catered to by tape devices. Tapes are still around, but there is a huge growth in cloud-based capacity. Pioneered by Amazon, the Simple Storage Service (S3) has evolved as a de facto protocol to store data as objects in the cloud. An archive tier retains data that have infrequent access patterns, yet must be protected and kept for prolonged periods of time.
The advantage of software-defined storage (SDS) is the flexibility it offers on sizing the tiers to the application needs. Unlike appliance vendors, SDS vendors do not lock you in. You can pick and choose your data tier and its corresponding capacity. Moreover, you can fine-tune the provisioning to varying business needs. For example:
Examples of Application-Aware Data Tiering
MapR XD is a cloud-scale distributed file and object store, built from the ground up. MapR XD makes it easy to store data at exabyte scale and supports trillions of files, provides enterprise-grade features to be the system of record for large global enterprises, and uniquely combines analytics and operations into a single platform, enabling intelligent application development.
MapR offers data tiering capabilities in line with industry's leading solutions. It ingests file data into the "hot" (performance) tier. Then, depending on customer-defined rules and schedules, the file data is offloaded to the "warm" (capacity) or "cold" (archive) tier. MapR Automated Storage Tiering (MAST) handles all of the data movement between the tiers. Some of these capabilities are planned for the upcoming 6.1 release of the MapR Data Platform.
Data in the MapR performance tier is highly available and resilient for faster, reliable access. MapR Erasure Coding for the warm tier offers cost-optimized capacity with data protection. Erasure coding can handle up to 3 failures; with RAID, you need expensive hardware; with RAID 6, it has capabilities to handle 2 failures. With MapR, customers can choose from multiple schemes of striping and parity.
Erasure Coding (EC) schemes, its fault tolerance, overhead, and minimum topology
Having multiple erasure coding schemes makes it possible to define multiple tiers for warm data. If you need high reliability, a 6+3 EC scheme may be better suited than a 4+2 scheme. If you are looking for less storage overhead, a 5+2 EC scheme may work better. With MapR, administrators have the flexibility to tune EC for applications, while keeping the costs low, without compromising performance SLAs.
MapR Storage Tiering
MapR allows customers to create the cold tier on a cloud vendor of their choice. This avoids vendor lock-in and eliminates configuration, setup, and metadata management overhead.
Tiering was and still is an intelligent solution for managing ever-increasing data growth. It is possible to design effective data tiering without locking into expensive appliances. With MapR, you can organize data into the hardware of your choice by specifying simple rules and schedules. MapR Automated Storage Tiering eliminates the need to move data manually while intelligently scaling and translating it for the cloud. With MapR, capacity and performance can be achieved simultaneously. Erasure coding for capacity tier makes a compelling choice, due to its resilience to failures and storage efficiency.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.