6 min read
In my previous blog, End-to-End Machine Learning Using Containerization, I covered the advantages of doing machine learning using microservices and how containerization can improve every step of the workflow by providing:
Today, I'd like to talk about an example open source framework called KubeFlow. The KubeFlow infrastructure provides the means to deploy best-of-breed open source systems for machine learning to any cluster running Kubernetes, whether on-premises or in the cloud.
While the early playing field was rife with competitors, like Mesosphere Marathon, Google Kubernetes, Docker Swarm, OpenStack Magnum, and VMware Photon, it's become clear that Kubernetes is now the industry's de facto standard. And, as a result, an ecosystem of tools began to emerge around Kubernetes, similarly to how it did when Hadoop first emerged from Apache.
As it began and remains in the Hadoop ecosystem, the Kubernetes ecosystem is starting out as a conglomerate of occasionally integrated and interrelated tools intended for use by data scientists and data engineers. The advantage so far for Kubernetes, in this regard, has been the ability to deploy pre-built offerings from container registries, allowing tools to be easily downloaded (‘pulled') and deployed on systems, without the traditional install pain around compiling from source that was frequently present in Hadoop ecosystem projects.
And this is sufficient for simple deployments of single containers running isolated processes. But, in most cases, users want to scale workflows up and down, using multiple containers to run parallel processes. In order to do this, templatized offerings and the ability to easily deploy them are needed. The most common way that this is managed in Kubernetes is by using Helm Charts, Operators, or ksonnets, which are collections of YAML files that describe a deployment template such that it's reproducible and can be used to generate interconnected pods of containers on demand.
What KubeFlow does is make all of this functionality a bit more user-friendly by providing some of the commonly used machine learning projects as pre-built templatized offerings (ksonnets) that are pretested to integrate together in one Kubernetes namespace –not unlike our MapR Ecosystem Packs. The initial list is based off of a common TensorFlow deployment pattern and has been opened up, or 'democratized,' to support other engines and modules.
Here are a few of the offerings that are available in KubeFlow, but the list is always growing:
MapR and KubeFlow are a very natural fit. Both are modeled on the concept of a namespace but use it to manage separate and complementary functions.
In MapR, the global namespace is the key to unified data access and allows the joining of data across any divide, whether it be geographical or architectural. The MapR Global Namespace allows read/write access to any dataset to which the user has access, as if it were a local resource. This enables data security and isolation at the user, team, and tenant levels, and MapR-SASL tickets are used to securely authenticate users.
In KubeFlow, a Kubernetes namespace is used to manage cluster compute resources, Kubernetes objects (e.g., pods), and application/job deployments. Namespaces are logical entities that are used to isolate and represent cluster compute resources and jobs at the user and tenant level. Kubernetes Secrets are used to authenticate users and can be set up to synchronize with MapR-SASL tickets for seamless integration with platform security.
In both cases, namespaces are used for access control and to logically isolate tenant processes and data, which is ideal for multi-tenant organizations looking to easily manage security and performance. These namespaces complement and integrate with each other very nicely, leaving the end user with a seamless experience and the DataOps teams with a simple architecture to manage.
Containerization and a microservices architecture are critical across the entire data science workflow from prototyping to monitoring models in production. KubeFlow is a possible solution that does a really nice job of solving administrative and infrastructure problems while still allowing users to select their own tools. And, with MapR, these workflows can benefit from a best-of-breed data platform to speed the time from sandbox to production.
Stay tuned for the next iteration of this blog, where we'll go technical and describe how to get up and running with KubeFlow on MapR and run an experimental workflow from end-to-end. And, stop by the Global Big Data Conference in Santa Clara on August 30, 2018, to see Rachel Silver present this material in her session titled: "A Containerized Approach To Data Science" (4:10PM - 4:50PM, Room 203).
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.