The ability to create and manage snapshots is an essential feature expected from enterprise-grade storage systems. This capability is increasingly seen as critical with big data systems as well. A snapshot captures the state of the storage system at an exact point in time and is used for recovering data that was lost or corrupted due to application error, user error, or a malicious event.
Snapshots within a big data context are useful in both storage and compute. The MapR Converged Enterprise Edition of the MapR Converged Data Platform provides consistent snapshots that offer the following benefits:
RECOVERY FROM ERRORS
Operational as well as analytical applications manipulate the data in the distributed file system on behalf of the user or the administrator. Application-level errors or even inadvertent user errors can mistakenly delete data or modify data in an unexpected way. In this case, snapshots can be used to restore data quickly and easily
MANAGING REAL-TIME DATA ANALYSIS
By using snapshots, query engines like Apache Drill can produce precise SQL query results against data sources subject to constant updates, such as sensor data or social media streams. As new SQL queries are written, either to achieve more specific data results or to improve performance of existing queries, if the underlying data is constantly changing, it is difficult to assess query improvement. Snapshots allow new queries to run against a specific point-in-time image of data to make "apples-to-apples" comparisons against existing queries. Snapshots ensure the data for both queries is constant, even in a scenario where data is continuously being ingested in real-time
MACHINE LEARNING MODEL TRAINING
Machine learning frameworks, such as TensorFlow, can use snapshots to get a known, unchanging, point-in-time view of an otherwise continuously changing data set. These snapshots allow for an auditable model training process, where ongoing updates to machine learning models can be run against a static baseline data set. This helps to identify actual improvements in the model, since the snapshot data remains unchanged.
VERIFYING DATA LINEAGE
Data lineage helps businesses track the data sources, transformations, and destinations as the data flows through MapR. It is often used to identify the cause of any irregularities in a data set or report. However, to investigate an observed irregularity, auditors have to view the history of the data as it was when the data was created. Snapshots create an immutable view of the data to ensure the lineage cannot be corrupted from modifications.
Organizations often have to demonstrate they were in compliance with required standards over a certain time period. In such a scenario, snapshots can help businesses show an immutable view of data at a certain point in time to prove they were in compliance.
MULTIVARIATE AND A/B TESTING AND DATA VERSIONING
Snapshots can be used to create immutable views of the results of multivariate and A/B testing. This means the results can be compared and analyzed while being safeguarded against inadvertent or even malicious modifications.
With snapshots, you can create a baseline for consistent backups. Snapshots can be created when the system is running, thus allowing you to create consistent backups without having to halt data updates in the system. Snapshots also allow you to backup and restore online MapR Database tables (using the CopyTable functionality).
In MapR, a snapshot is a read-only image of a volume at a specific point in time. Snapshots are created at the volume level, not the entire cluster level, which means you can isolate them to specific segments of your data sets. You can create a snapshot in an ad hoc manner or automate the process with a schedule. A snapshot takes almost no time to create and uses very little incremental disk space because they are not standard copies of data. They are implemented with a method known as "redirect-on-write" which is very space-efficient. You can take a snapshot of a 1 petabyte cluster in seconds with no additional data storage required.
The redirect-on-write method provides protection without duplicating the data. It uses pointers to keep track of data blocks. If a block needs modification, MapR simply redirects the pointer for that block to another block and writes the new data there (i.e., it redirects on writes). MapR keeps track of all the blocks that comprise a given snapshot. If a process accesses data in a given snapshot, it simply uses these pointers to access those blocks.
Here’s a brief explanation of how redirect-on-write works:
Time T0. File F1 is written. It uses data blocks A,B, and C.
Time T1. We create snapshot S1. Data blocks A, B, and C are made immutable and S1 points to A, B, and C.
Time T2. We need to make changes to F1. Since MapR-FS supports full read-write capabilities, any part of the file can be updated. In this example, let’s say the data in block C needs to be updated. Since block C is immutable, we create a separate block C’ and add the changes there. Snapshot S1 still points to data blocks A, B, and C. We can restore or view the contents of file F1 as they were at time T1. Current contents of F1 are held in data blocks A, B, and C’.
Time T3. We create snapshot S2. Data block C’ is made immutable, and snapshot S2 points to A, B, and C’.
We can now restore or view the contents of file F1 at timestamps T1 and T3 using snapshots S1 and S2, respectively. Note that there is no data duplication of blocks A and B between file F1 and snapshots S1 and S2. Only the data blocks that are changed after a snapshot is taken will require incremental storage space.
Another important fact is that in MapR, the snapshots are implemented directly in the storage layer in an efficient and fast way. Any application that saves data in MapR benefits from snapshots out of the box. Moreover, since MapR Snapshots are atomic and consistent, applications have exactly the view of the data at the time that the snapshot was taken. This is not true on HDFSbased data platforms, as will be explained in a section below.
The following sections describe procedures associated with snapshots:
Creating a snapshot (requires MapR Converged Enterprise Edition). You can create a snapshot in an ad hoc manner or use a schedule to automate snapshot creation. Each snapshot created by a schedule has an expiration date that determines how long the snapshot is retained. When you schedule snapshots, the expiration date is determined by the Retain parameter of the schedule.
Viewing the contents of a snapshot. At the top level of each volume is a directory called ".snapshot" containing all the snapshots for the volume. You can view the directory with Hadoop commands (e.g., "hadoop fs -ls") or by mounting the cluster with NFS to view the directory with operating system tools (like the Linux "ls" command).
Viewing a list of snapshots. You can view a list of snapshots for a volume with the volume snapshot list command or with the MapR Control System (MCS).
Removing a snapshot. Each snapshot has an expiration date and time, when it is deleted automatically. You can remove a snapshot before its expiration, or you can preserve a snapshot to prevent it from expiring. You can remove a snapshot with the volume snapshot remove command or with MCS.
For more information about using the snapshots functionality through the MapR Control System, please refer to the official MapR documentation.
MapR Snapshots capture data in a precise and consistent state. HDFS and HBase snapshots, in contrast, do not provide consistency and lack many other important capabilities. This means that HDFS and HBase snapshots cannot be trusted for the scenarios described earlier in this document.
HDFS is an append-only file system, so intuitively it appears easy to implement snapshots. However, the separation of data and metadata in HDFS, combined with the NameNode being a bottleneck in the system, makes it difficult or impossible to implement consistent snapshots. As a result, HDFS snapshots took years to implement and are not consistent; hence, they do not work with applications that were not specifically designed to support HDFS snapshots and their limitations.
Applications must be made snapshot-aware by calling a new HDFS API that sends up-to-date file length information to the NameNode (SuperSync/SuperFlush). It is hard to design such an application to work correctly, since the use of SuperSync across a cluster can overwhelm the NameNode, causing the entire cluster to fail or causing other processes to come to a halt. Moreover, applications making use of snapshots must avoid modifying files during the creation of the snapshot.
HDFS snapshots apply only to the metadata (on the NameNode), so they do not work correctly while files are being written. This happens because the NameNode has difficulty handling even 1000 metadata updates per second. To avoid this, HDFS avoids sending the file length to the NameNode on every hsync/hflush, because that would overwhelm the NameNode. The effect of this implementation is that files that are being written continue to change inside the snapshot, meaning the snapshot is not actually a snapshot; it can contain data that was written after the snapshot was taken or can fail to contain data that was written and flushed before the snapshot was taken.
Because of the storage semantics, HBase snapshots cannot rely on the underlying HDFS snapshots and need to be built separately. This is unlike MapR Snapshots, where there is a common snapshot capability that applies to all data in the cluster. However, HBase snapshots are also not consistent, as each RegionServer snapshots its own data at different times. HBase snapshots exhibit the same problems as HDFS snapshots in terms of containing transactions committed after the snapshot and missing transactions committed before the snapshot.
The following table compares MapR Snapshots with HDFS and HBase snapshots:
Enterprise data storage solutions have offered the consistent snapshot capability for years, and only MapR offers the same for big data. MapR provides snapshots as first-class citizens of the platform, enabling any kind of application to benefit from it, out of the box and with no special application modifications. MapR Snapshots are proven, having been used by customers in production throughout different verticals since 2011.