5 min read
Two blogs came out recently that share some very interesting perspectives on the blurring lines between architectures and implementation of different data services, ranging from file systems to databases to publish/subscribe streaming services. The timing couldn’t be better for these, because MapR has recently announced our latest wave of convergence, and we’ve learned some key lessons along the way on how to get it right. Before we get to that, let’s review what they had to say.
In the first blog post, Kafka and Confluent, Curt Monash writes from a conversation with Jay Kreps, co-founder and CEO at Confluent: “Jay also views Kafka as something like a file system. Kafka doesn’t actually have a file system-like interface for managing streams, but he acknowledges that as a need and presumably a roadmap item.” Of course this was just a minor point of the blog, but I’m sure it raised a few eyebrows. The second, Kudu as a More Flexible And Reliable Kafka, prototypes a publish/subscribe system implemented on a columnar database in order to achieve greater reliability (through strongly consistent writes) and more flexibility (through modifiable messages and supporting many more topics).
So, to summarize, one blog post describes a file system built on a publish/subscribe system, and the other describes a publish/subscribe system on a database. Interesting, right? If you think for a minute about the architecture of these different systems, it isn’t hard to understand why people are thinking this way, as many of the things these systems need to overlap, such as:
Once all of these problems are solved in one system, it is extremely tempting to build other systems on top to leverage the technology. There is just one problem - impedance mismatch. Typically, when a system is designed, the architecture is over-fit to the type of data being served. Some services get optimized for random read/write, others for sequential. Some are designed for availability, and others for consistency. Volumes have been written about queues on databases alone.
So how did we avoid this impedance mismatch when developing the MapR Data Platform? Rather than trying to stack data services on top of each other, we spent our first years as a company developing a robust, patented, extremely fast container-based architecture. Because of this architecture, we could build in purpose-optimized data structures and datapaths for files, database tables, and streams, achieving:
Note the difference here. We didn't build a file system and then build a queue on it, or build a database and build a queue on it. We built a storage platform that supports the common core attributes and then built multiple different higher level components on that common platform. The result is you get the best of both worlds - one converged platform but without the painful tradeoffs.
Let’s wrap up with some advice for approaching building data architectures:
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.