Dataware for data-driven transformation

High-level view of how MapR enables multi-API access to files, tables, and streams

Contributed by

9 min read

Editor's Note: Direct access to distributed data via multiple standard APIs is a huge advantage in working with data at scale – and that open, multi-API access is one of the things MapR dataware makes possible. In today's Whiteboard Walkthrough video, Ted Dunning, Chief Application Architect at MapR Technologies, describes how MapR's universal path name helps support multi-API access to files, tables, and data streams, all in the same system and the same data platform.

Transcription:

Hi. I'm Ted Dunning from MapR Technologies. And I'd like to talk a little bit about exactly how the universal path name and the multi-API access to files, tables, and streams work in the MapR system. And to end that, I'm going to talk a little bit about why it works that way. The basic idea is that you have a directory structure. It's a very familiar concept from file systems forever. But in this “file system” --the MapR Data Platform-- you have directories, as well known. We have another video about how some of those directories may actually be volumes, but we don't need to worry about that here. For this purpose, they all look like directories. But inside these directories, we could have objects, first-class objects as good as any other, that could be streams that exhibit the Kafka API, files that exhibit either the HDFS API or are accessible via NFS, or are accessible via the POSIXAPI via a FUSE file system, or tables.

Now, that's kind of bizarre, because normally all you find inside of directories are files. Then inside a database somewhere are tables. And then, off in some other kind of cluster, you'll have streams, but here they all live in one directory.

What that means is they all have path names: /data/click/stream (I'm just picking a name for stream; I couldn't come up with a better one), /data/click/file,/data/click/table. The first part of that is the same. And it's because they inhabit the exact same directory. It's not that multiple...I mean, you can have separate computers that happen to have the same directory names on them, and you could have files in them, but that's not the same as the same directory. And this is the same directory; it has multiple kinds of things in it.

When it comes time to access these things, there's two ways, two general ways, that that happens. One way is you have your program. Again, my naming scheme's efficient today. But your program has a library attached to it. It could be a Kafka library, it could be a HDFS library, it could be a MapR DB library, an OJAI library, but there's a library attached to your program,linked to it. And that library takes your calls from your program and converts those into a MapR RPC wire format.

Now, the conversion can be quite intricate. The reason for the intricacy is largely performance. But simply put, if you have a table, then the access looks like open, get, or put, fancy stuff like find, possibly fancy stuff like update and close. That's what the calls to the library look like. Then those are translated into much lower-level calls via RPC over here to a MapR cluster.

Now, inside the MapR cluster, smeared all the way across all the disks, are pieces of all of the objects we have, and those pieces constitute into directories, streams, files, and tables. The RPC layer knows exactly which machine is necessary to talk to in order to accomplish any particular function. The functions can be directory lookups to try to find the objects or direct calls to the pieces of the object that you're working on. I might be writing to a piece of a file; I might be reading from a few rows of a table. I would know where those pieces are and the RPCs would go there.

Now, the second major way that this can happen is I can have a program as before, but the program can be talking a standard API –standard file API that goes to the kernel, and then it comes out the other side. It can come out in two different ways. It can come out as an NFSclient that goes to an NFSserver that talks to the MapR RPC, and we could just hide that. The client and the server are essentially one thing then. Or it could go to something called a FUSE driver –file system in user space (and the authors of that apologize for the fact thatthe acronym doesn't really quite work). The idea here is that the standard API is accessiblejust like any kind of Linux file system.

That means this program has nothing special in it. It is completely a standard program. Unlike this program that has a special purpose library attached to it,this program has nothing more than the standard Linux libraries to use. So it talks to thisFUSEdriver that, again, talks to the MapR RPC on the wire to the cluster.

Now, these programs could be running in the cluster on machines that are running MapR software, or they could be running outside the cluster. That's the wonder of networking. They could be running in Kubernetes-managed containers or just basically anywhere you like – anywhere that there's access via the network to the cluster and anytime that they've managed to authenticate themselves. There's a ticket that's involved in each of these steps that's cryptographically signed, and so on, and then cryptographically verified on the server side, in order to validate permissions, and so on. But the basic idea is that there's some sort of translation layer, a library, or a FUSEdriver, or NFSserver/client pair translates the basic things we want to do into the MapR RPC. The things, of course, on files are 'read these bytes' or 'write those bytes.' That's the basic operation. That's how it works.

Now, the deal is, the reason that it's so important that this works this way is that because everything shares a single namespace of these path names and that means that they can uniformly be managed and uniformly be controlled in terms of access. And you can group them and organize them in a consistent way. It's actually not news, but it is news. It's not news because the virtue of this was recognized 40 years ago, 50 years ago almost, when the Unix hierarchical file system was invented – and even before that with the stuff that inspired that.

The part that is news is that you can apply it to many different kinds of objects. So if I have a configuration file and a stream that controls some process from the input, say, of it, the configuration file teaches it how the topics are arranged or how to interpret data or something like that. They can live together in a directory. This is Programming 101, but the fact that we can put streams and tables into this basic, basic structure, this organizational framework, is really, really big news because it makes it possible to use very simple programming and management techniques to control and structure programs.

We can also set permissions on higher-level elements, even volume permissions, to control access to these in a completely uniform way, and that's much better than having multiple different ways of expressing permissions. And this is back to the fundamental idea of APIs:if there are common operations, common operations like change the permission on something, those should be done in a common way. And with this kind of structure, all of the permissions are carried by the directory, not by the object. And so it's done in a completely consistent way.

So there you have it. The basic idea of a hierarchical structure with common path names, implemented by translation layers, to go to a distributed data platform, which gives you these organizational powers. That's what we mean when we talk about multi-API access to a MapR cluster.

Thank you very much. I'm Ted Dunning from MapR Technologies.


This blog post was published February 08, 2019.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now