9 min read
Speaker: Ted Dunning, Chief Application Architect at MapR
Editor’s Note: Fine-grained control over data access is a key aspect of data management, security and provenance. In this short Whiteboard Walkthrough video, Ted Dunning, Chief Application Architect at MapR Technologies, explains how MapR’s unique management feature – MapR Volumes – lets administrators set access control for files, tables and streams in just one step.
Hi there. I'm Ted Dunning from MapR Technologies. And today, I want to talk about access control using volumes. This is a very, very important thing about the MapR platform. What it lets you do is have an administrator set bounds on what can be done with a whole swath of the file system. When I say file system, I mean files, tables, and streams. It's a generalized data platform, of course. I'll typically slip and say file system just because that's where I come from, but I mean the data platform.
It's pretty common that we have some kind of system that you might draw like this. It's got goes-ins, and it's got goes-outs on the other side. That's the important thing. Here, we've got web, point of sale, and ET, meaning ‘phone home’ kind of data. And they go into some sort of data cleaning ... notice my hands are waving ... and stored in an archive. I drew a database just to remind everybody that database is files, streams. All exist in here. And then later, it goes through some sort of arrangement and extraction process – feature extraction is what I mean there – to produce training data. Training data, so you can do some kind of machine learning. And the arrangement: there will be points in time that apply to different people. So you'll want to say, "Oh, just before that when I had to make a decision or I could have made a decision, what did I know?" Of course, those moments in time will be different for every different person, and so you need to rearrange the data to align those moments in time for all the different people with what did you know just before and what was the outcome just after. That's a very typical thing that you do for training data preparation.
Now, that block diagram of how the data flows is only part of the story. The rest of the story has to do with how we arrange the data. We will, of course, use some sort of directory structure. That's what you do. And when you use a directory structure like that--here's an example: we’ve got applications there sitting under some place, and we’ve got raw, meaning the raw input; we've got some archive directory, and we've got a training data directory. I haven't shown any details here under the archive directory. And then, we have here three little triangles called web, POS, and ET. Those correspond to the three different sources of data, but there's something different here. Instead of just showing directory structure, I've shown these black triangles with the red stuff on the inside, and I'm using that to express what's in MapR's data platform called a volume.
Now a volume, as far as users and programs are concerned, looks exactly like a directory. But to managers, volumes have extra management superpowers. And the one we're talking about today is how we control access to volumes. So we've got three different volumes there. Commonly, people use volumes for mirroring and snapshots. We've got a couple of volumes here under the training data thing. That's where our output is going to go. This is where the input is going to come from. Here's where the output is going to go. And I've just fleshed out a hypothetical structure underneath the marketing training data volume. Underneath that, there will be multiple directories. And underneath P1, there will be files, tables, and streams. Not exactly like this, I'm just illustrating the idea.
Now the cool thing here is that even though data scientists, data engineers are going to probably ... I'm casting some aspersions here ... they're going to take some liberties about doing stuff to make it easier for them to get their job done. Notably, they might set the permissions on some subdirectory or some file to be world readable. They might own this entire hierarchy as far as they can tell, and they might just wide-open open-permit everything in that entire hierarchy. But the volume itself has a separate kind of permission called a volume ACE. That's an access control expression. Because it's a full Boolean expression, we can ‘AND’ the volume ACE and the expression, which represents the permissions on every file or directory below it.
What that does: it allows the administrator to bound the permissions on everything below that volume. That bound can be restricted further by people setting tighter permissions down there, but it cannot be expanded. When you have a Boolean ‘AND,’ there's nothing that the second part of that expression can do to make it looser. The bounds expressed by that first part define the outer bounds. And so we have separated who controls what. The cool thing here, then, is we can have a separation of powers. Overall, the benefits that we get here from the system is that we have universal path names. Remember we have streams, files, and directories, and tables in the same directory. That means they're in the same volume, and that means the volume, or ACE, applies to everything in there.
With one step, one configuration control, the administrator has set the allowable maximum permissions on everything in there. The path names refer to all different kinds of things. And so, instead of having to say, "I'm going to set the permissions in HTFS this way, and I'm going to set the permissions in Kafka that way, and I'm going to set the permissions in some kind of database a different way, and I've got different user spaces and different kinds of permissions"... you know you're going to screw up one of those places. It's just humanly not feasible to get it always right. But here, where you have the volume ACE on the volume, you can guarantee that you get it right because you do it in one place, and it's inspectable and controllable. So the universal path names are a big deal here. The volume ACEs are clearly a big deal. They're basically the function that we're talking about. There's a result of separation of powers. The administrator sets the limits of what can be done, and then the details of what can be done within those limits is set by the developers, the data scientists. And furthermore, this is an invisible control for the most part. People can go inspect it if they look in the GUI or use the special command line, but for the most part, everything works like an ordinary directory, an ordinary file, and so on. And so the limits on the permissions are not visible to people. They don't have to be set. They don't have to be considered in their day-to-day lives.
And we can furthermore set the volume limits on inputs to be only the things that are allowed to read them, and the volume levels on the output area to be only those things that should write them. What this lets us do is essentially the provenance of data: we set it; we control it and sort it, instead of decode it later. This is a big deal to be able to manage access using volume access controls.
Thank you very much. This is Ted Dunning from MapR Technologies.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.