Keeping Big Data Containers Lightweight

March 08, 2017 | BY Ted Dunning

In this week’s Whiteboard Walkthrough, Ted Dunning, Chief Application Architect at MapR, explains how to keep big data Docker containers light and agile by moving state into the MapR Converged Data Platform for large scale data persistence that goes beyond files: you can persist streams and tables, too. That way even Kafka applications can run in stateless containers without having to proliferate Kafka clusters all over.

To learn more about the Docker-based MapR Persistent Application Client Container:

Here is the full video transcription:

Hi, I'm Ted Dunning, and I'd like to do a quick whiteboard walkthrough about strategies for large-scale container deployments. In particular, large-scale container deployments in the presence of large-scale data. Well, we're going to have two situations. One in which you wind up in difficulties, one in which things work better.

What happens here? Let's ignore the upper part, here. So we have applications running in these circles. These circles are containers. "A" stands for application. Now, what we've done ... What always happens, and this is a very good thing to do ... Is we don't want to have data in the application containers. The reason we don't want that is, the life cycle of the data is very different than the life cycle of the application container. The application container needs to come and go, needs to scale up, meaning get more containers, it needs to change over when we change versions. The data needs to survive all of these transitions. It has a different life cycle. So the conventional answer is to store files in a filer, like a net app. And by putting all of the state into files, we leave the state out of the application containers, and we're all happy because the applications are all sprightly and agile, and things like that.

Well, we're happy for about two weeks. And what happens then is, one of the application dev ops teams says, "We need to connect our micro services with Kafka." Which is a good move, you know? I'm not saying that's a bad thing. Moving things through streams is a good way to connect these large services. So the immediate response is, "Let's just go do it." We can start containers, we can put software in containers, so let's start a little swarm of containers there, and let's stick Kafka in there. Now, immediately what happens is, they store the data, the Kafka data, inside those containers. Well, before we were able to push the state of the applications down into the net app, where it's filed, but you can't do that with Kafka because the bandwidth is too high. So you wind up leaving the state in the containers.

Shortly thereafter, somebody says they want a NoSQL database of some kind, and somebody else wants something else. So you wind up with a bunch of these little swarms with little clusters. They don't glue together well because these solutions typically don't do large-scale multi tenancy, and you wind up with lots and lots of state-full containers. These state-full containers are problematic because they're heavy. They don't move around; they're not agile. And that mixes the mission of your container form between applications and data heavy services, like Kafka.

Now, you could say, "Oh, well, let's just not run these in containers. Let's just run them down here parallel to the net app." But that's not a very good solution, either, because the infrastructure team is typically small, very careful, and if you suddenly give them all of these separate clusters, each with their own maintenance and management tools, each with their own management protocols, each with their own care and feeding requirements, it will completely overwhelm the typically quite small infrastructure team. Not to mention the fact that they're conservative; they don't like changing service, they don't like rolling new versions all the time. So they prefer to have a solid, trusted, single platform.

So those two reasons lead us to having these state-full containers, and that's a very frustrating sort of situation. We can do better; we can do much better. What we can do is have the same orchestration layer, we can have the same applications. They can push their state down into a MapR Converged Data Platform. What that lets them do ... It lets them put streams, files, and tables into the data platform. It satisfies all of the major current needs for persistence. It also scales in bandwidth, and it scales very nicely in terms of management effort. It's one platform, it's one authentication scheme, it's one authorization scheme. So you have one set of users, you have one kind of permissions, you have one universal name space. So it's very, very easy to manage.

Something that infrastructure people like a lot is something that's consistent like that. The dev teams actually like that a lot, too, because it relieves them of the burden of most of the tasks associated with governance because that can be handled at a platform level in a very, very simple and platform-defined and consistent way. That lets them focus the dev ops teams on developing applications that drive business value.

Now, you can compare. This is simple. It's simple below, where you have a single system to manage, which was designed to scale. And it's simple above, where you have one kind of thing to develop. The alternative, where you wind up pushing a lot of persistence up into the container form, is difficult to manage, it's difficult to run, and it's frustrating. It's frustrating for all of the teams involved because of these heavy, heavy containers with lots of data in them. You can guess which I'd prefer. Not this one. I like simple; I'm kind of lazy that way. But this really works; stable as a rock, easy to develop for.

So that's my view on how large container forms should be managed. My name's Ted Dunning, thank you very much.