9 min read
In this week’s Whiteboard Walkthrough, Dale Kim, Sr. Director of Industry Solutions at MapR, describes how MapR addresses the challenge of providing a persistence tier to containers in big data settings. Dale describes a new technology to support Docker containers, the MapR Persistent Application Client Container, or PACC. This lets you deploy containers anywhere, with security enabled, while also providing access to the MapR Data Platform, which includes NoSQL database and streaming message transport as well as files for the persistence layer.
For additional content on these topics, see:
Here is the full video transcription:
Hi, I'm Dale Kim of MapR Technologies. Welcome to my Whiteboard Walkthrough. In this episode I'd like to talk about how MapR can help you in your containerized infrastructure. Now you might be familiar with hypervisor-based virtualization technologies, and containers work very much the same way. The big advantage with containers is that you don't have to run a guest operating system on top of a server host operating system. They are much more lightweight and much more efficient.
Now some of the benefits of containers in a virtualized environment include the fact that you can isolate system dependencies within the container so that you have your application and container control those. Another advantage is better hardware utilization. You can package your application within a container, deploy that anywhere based on your available resources. Another big advantage is the predictability of deploying applications where you might build it in a test and dev environment but you want to make sure that it runs about the same way in a production server even if they're completely different servers from what you tested on.
Those advantages are pretty clear. One of the big challenges about using containers today is how to deal with persistence, so saving data, creating state and how you get it within a container. That's a problem that many container users have faced. Let's talk about some of the options that are available today.
In this diagram the rectangles represent servers. The circles represent containers. Triangles are applications. The cylinders are storage. In this first model you have applications right into storage within the container. That's a limited approach because if a container goes down you lose all the data that you saved. You lose the state, and that's not a great long-term solution. In this second option, you might have your application write to the local hosts disks. That doesn't work very well either, particularly at scale because if you deploy containers across many different nodes, you have your IT ops folks tracking down that state and moving them to the new locations of your containers. A lot of overhead there, and that too does not really work very well.
Another option is to have your hosts attach to a NAS so that your applications within your containers are accessing the NAS and you have that centralized repository, but this tends to add a lot of complexity. You have a lot of these hosts that are attaching to NASes. It kills performance and also, NASes are only good for files. You definitely want other types of persistent stores, not only files, when you're building out your infrastructure. Finally there's an option where some in merging technologies are appearing that help solve, or at least try to solve, the persistence within containers. A lot complexity, a lot of administration, very early stage, so it remains to be seen how well they will do.
You see all the benefits of containers but you don't want to necessarily go through a lot of compromises to deploy them within your environment. Let's take a look at some of the things that you absolutely want. When deploying your existing applications, you don't want to have to make a lot of code changes. You want to make sure that your environment is consistent with what the application expected when you first wrote it. A lot of times you have applications that are pre-packaged or that someone else wrote so you don't have the opportunity to change them. You want your environment to be cost-effective. You don't want to be spending a lot of money just for the sake of deploying your containers, thus losing out on a lot of the other advantages.
You want the system to be scalable. In many of today's architectures, you'll see growing volumes of data so you want a system that can grow with the growth of data, particularly when these applications are operational in nature and create data. You want the system to be manageable. A lot of times the notion of containerized infrastructure are shot down by the dev ops folks because there's a lot of work involved. You want to make sure that the system is as easy as possible to manage. Finally, you want to be able to have access to other types of stores in addition to files, things like NoSQL databases and streams, because after all, when you're building out a modern data architecture to get more types of insights from your data, you need to be able to store data in the right type of format. This is one area where MapR is particularly good. In fact, all of these things that you want as part of your containerized infrastructure, MapR can help you with.
Let's look over here. As part of our container story, let's talk about two parts. First is the MapR Data Platform. This acts as your persistent store. Not only do we have file access, but we also provide MapR-DB, which represents the JSON NoSQL document database, as well as MapR Streams, for your pub/sub framework. All those capabilities together provide the different types of data, the different types of formats and compute engines that you need as part of your overall infrastructure.
That's on your infrastructure side. On the application side we provide what's known as the MapR Persistent Application Client Container. That allows you to deploy your applications within these containers that easily and automatically have access to persistent store within MapR. The advantage of using the PACC is that you can deploy these anywhere. You can deploy them on your on-premises node. You can deploy them again on a different host, on a different cluster, in a different data center, even across the cloud. You have the flexibility because everything you need in terms of access to your persistence tier is within the PACC.
In addition, some of the advantages of the PACC include the fact that security is enabled so that the transmission to the MapR Data Platform is encrypted to the communications are protected. In addition to that, you can authenticate at a container level so that you can ensure that your applications are only accessing the data that they're authorized to access. Scalability is a very important trait within MapR, so that if you want to expand your cluster because your volumes of data are growing, you simply provision a server with MapR, add it to the cluster, no extra sharding, no load balancing, you just add that node and you're ready to go.
Of course, finally I had said that some of these extra persistent stores you need as part of your architecture. Think about a microservices architecture where you're simplifying applications moving from a monolithic type of architecture to more task-specific applications that work together. That's where MapR-DB and MapR Streams are particularly important, where you have files for logging MapR-DB for storing state. MapR-DB is especially useful when you have evolving, hierarchical and nested data stored in JSON format. Then MapR Streams as a means of communicating between each of those microservices in a very lightweight, reliable way.
Hopefully this Whiteboard Walkthrough gives you a good overview of how we can help with your containerized infrastructure. If you have any questions please field a comment below. Thanks for watching.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.