Append-only File System vs. Read-Write File System #WhiteboardWalkthrough

Contributed by

7 min read

Editor's Note: In this week's Whiteboard Walkthrough, Jim Scott, Director of Enterprise Strategy and Architecture at MapR, talks about the implications of append-only file systems and the impact they have on downstream projects in the Hadoop ecosystem. He starts off by demonstrating this concept using HBase, and how it has forced HBase to have to consider certain implications on the functionality of a real-time capable data store.

Here's the transcription:

Update: There are a couple clarifying points that have been added to the transcription to clarify / correct the comments I made in the video. I hope these clear up any potential confusion that may have been created from my original statements.

Hi – I'm Jim Scott, Director of Enterprise Strategy and Architecture at MapR. Welcome to this Whiteboard Walkthrough on the implications of append-only file systems and the downstream projects in the Hadoop ecosystem.

What I'd like to talk to you today about is how an append-only versus a random read-write capable file system impacts downstream projects. I'd like to demonstrate this concept to you using HBase.

HBase is one of the most well-known applications that runs on top of HDFS. HDFS, being an append-only file system, forced HBase to have to consider certain implications on the functionality of a real-time capable data store.

Now, what we see is when writing records in HBase, these records are going to land in a file in the file system, so as record one is written, record two follows and so on and so forth. Now, this is great when you're appending data. It has very minimal impact, but think about when you want to edit something like record two: I made a mistake, and I really wanted “A” as my value here.

What happens is I need to have this concept of a tombstone. Now, the creators of HBase had to think through this a little bit to make sure that they could account for updates, so the concept of a tombstone is to say that record two in the HBase file is no longer valid. What happens is record two is written in the tombstone file and record two in the HBase file is now invalidated. No longer will it be used as the value for record two. This is fine. From a real-time perspective, you now have the ability to look at the data in the tombstone record and in the HBase table. [Update]

  1. The example I use of updating a record doesn't actually create a tombstone. If the versioning on a column family is set to 1, then performing the update will in fact remove old data during the compaction process, but it doesn't create a tombstone record. Deletion of a record creates an actual tombstone entry.
  2. HBase does not create a physically separate file when a tombstone is created. Tombstones are structured inside the same file as the rest of the data. The compaction process reads the tombstoned record identifiers. I prefer to talk about tombstones as being separate from the data, but that is only because they are really metadata saying that other data no longer exists. Logically it is simpler to think of them this way, but physically they are not separated.

Now, the real problem comes when this tombstone file grows very large. At some point, a compaction process that I'll just label CP will happen between these two files. This compaction process will yield a new HBase file. What will happen is this HBase file will get created and then this HBase file will go away. Now, the reason this must be done this way is because HBase, running on HDFS, is append-only. Those semantics could only ever add to the end of the file. It can never go back and edit this record inline.

From an implication of the downstream project perspective, the creators of HBase had to consider all of these fundamental implementation features of HDFS. Now, HDFS was originally created to stream website pages into for indexing. In this model, those things never got updated.

Well, as the ecosystem grew, the need for being able to do things in more real time changed. What I like to use as an example here is that HBase had to implement a typewriter-style semantic, which is as you type on a typewriter, if you make a mistake, you have to start your whole pager over again.

This is really the same concept that HBase had to implement. What happens is another side effect is this compaction process, because it needs to change the fundamental read-write capability of these files, needs to lock the table out in some capacity, it needs to then create the new file, and it needs to swap the new file in for the old, which then is going to cause latency.

This is going to impact production systems, and so if you ever read the documentation for HBase, you'll notice that it has a recommendation that says, “Do not enable compactions in production automatically.” It generally suggests enabling manual compactions. What this means is that you have the ability then to go in and perform the compactions process at off hours when it will have minimal implications.

By comparison, MapR Database will bypass this whole process because it can go straight to this record and edit it inline because it is a random read-write capable file system. All of the layers that are in place to make HBase work with HDFS are stripped out in the concept, in the implementation of MapR Database. This enables lower latency, it enables a faster overall read-write time, and reduces the total amount of administration to the platform.

That's all for this Whiteboard Walkthrough on the implications of an append-only file system on the downstream Hadoop ecosystem projects.

If you've liked this talk, please feel free to leave some comments. If there's issues or topics you'd like us to consider, please let us know. Don't forget to follow us on Twitter @MapR #WhiteboardWalkthrough. Thank you.

This blog post was published February 04, 2015.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now