Distributed Deep Learning on the MapR Converged Data Platform

Contributed by

9 min read

This is the third installment in our blog series about deep learning. In this series, we will discuss the deep learning technology, available frameworks/tools, and how to scale deep learning using big data architecture. Read Part 1 and Part 2.

Introduction

Deep learning is a class of machine learning algorithms that learns multiple levels of representation of the data, through message passing and derivation, cascading many layers of nonlinear processing units. Recently, there has been a lot of traction in the deep learning field, thanks to the research breakthroughs made by commercial entities in the tech field and the advancement of parallel computing performance in general. Quite a few deep learning applications have surpassed human performance: famous use cases in the field include AlphaGo, Image Recognition, and Autonomous Driving.

In most practices, development of deep learning applications is done using single a DevBox with multiple GPU cards installed. In some larger organizations, dedicated High Performance Computing (HPC) clusters are used to develop and train deep learning applications. While these practices are more likely to achieve better computation performance, they lack fault tolerance and create issues with moving data across different DevBoxes or clusters.

Distributed Deep Learning Quick Start Solution

The MapR Converged Data Platform provides the only state-of-the-art distributed file system in the world. With MapR File System (MapR-FS), our customers gain a unique opportunity to put your deep learning development, training, and deployment closer to your data. MapR leverages open source container technology, such as Docker, and orchestration technology, such as Kubernetes, to deploy deep learning tools, like TensorFlow, in a distributed fashion. In the meanwhile, since MapR-DB and MapR Streams are also tied closely to our file system, if you were developing a deep learning application on MapR, it is convenient to deploy your model to extend our MapR Persistent Application Client Container (PACC) to harness the distributed key-value storage of MapR-DB and cutting-edge streaming technology of MapR Streams for different use cases. Click here if you want learn more.

Picture1

The distributed deep learning Quick Start Solution we propose has three layers (see Figure 1 above). The bottom layer is the data layer, which is managed by the MapR File System (MapR-FS) service. You can create dedicated volumes for your training data. We also support many enterprise features like security, snapshots, and mirroring to keep your data secure and highly manageable in an enterprise setting.

The middle layer is the orchestration layer. In this example, we propose to use Kubernetes to manage the GPU/CPU resources and launch parameter server and training workers for deep learning tasks in the unit of pods. Starting from Kubernetes 1.6, you can manage cluster nodes with multiple GPU cards; you can also manage a heterogeneous cluster, where you can use CPU nodes to serve the model while using GPU nodes to train the model. You can even take a step forward and mark nodes with different GPU cards to task with lower priority on older GPU cards and task with high priority on newer cards.

The top layer is the application layer, where we use TensorFlow as the deep learning tool. With the high performance NFS features from MapR-FS, it is easy to use TensorFlow to checkpoint the deep learning variables and models to persist in the MapR file system. This makes it easy for you to look into the TensorFlow training process and harness the models, then put them into deployment. The advantage of using container technology in the application layer for deep learning applications is that we can control the versions of the deep learning model by controlling the metadata of the container images. We can harness the trained model into a Docker image with metadata as image tags to keep the version information; all the dependencies/libraries are already install-free in the container image. When deploying the deep learning models, we just have to specify which version we wanted to deploy, and there is no need to worry about dependency.

Picture2

There are typically 5 steps to get your deep learning application running on our proposed Quick Start Solution.

  1. Modify the TensorFlow application to add the distributed server. There are a number of ways to enable data parallelism in TensorFlow: synchronous training and between-graph replication is the more practical approach overall (click here for more information); we can, for example, add code snippet like:

    cluster = tf.train.ClusterSpec({"ps" : "tf-ps0:2222,tf-ps1:22222",
      "worker": "tf-worker0:2222, tf-worker2:2222"})
    server = tf.train.Server(cluster, job_name='ps', task_index=0)
    

    In this example, the ps/worker hostname, job_name, task_index could be passed in through the yaml file used to launch Kubernetes pods. You can also put the code on MapR-FS and mount it to multiple pods when launching the Kubernetes job.

  2. Prepare the training data and also load it onto MapR-FS. We recommend creating dedicated MapR volumes for your deep learning applications, so it can be better managed. Meanwhile, the persistent volume design in Kubernetes makes it possible to share the MapR volume between a few applications.

Picture3

  1. Choose the container image to use: for example, we use the latest TensorFlow GPU images, but to fully leverage MapR-FS, we recommend extending your deep learning image to our MapR client container to utilize MapR-DB and MapR Streams.

  2. Write a YAML file to create a Kubernetes job. We want to mount the required NVIDIA library, the TensorFlow application, the destination folder for checkpoint, and the training data location. Here, we can easily create a persistent volume mounted to a MapR-FS volume and grant multiple pods access to the persistent volume claim attached.

    access to the persistent volume claim attached.
         volumeMounts:
         - mountPath: /dev/nvidia0
           name: nvidia0
         - mountPath: /dev/nvidiactl
           name: nvidiactl
         - mountPath: /dev/nvidia-uvm
           name: nvidia-uvm
         - mountPath: /usr/local/nvidia/lib64
           name: lib
         - mountPath: /tfdata
           name: tfstorage001
       volumes:
       - name: tfstorage001
         persistentVolumeClaim:
           claimName: pvc001
       - hostPath:
           path: /dev/nvidia0
         name: nvidia0
       - hostPath:
           path: /dev/nvidiactl
         name: nvidiactl
       - hostPath:
           path: /dev/nvidia-uvm
         name: nvidia-uvm
       - hostPath:
           path: /usr/local/nvidia/lib64
         name: lib
    

    Picture4

  3. Check that the result persisted to MapR-FS and further deploy the model if the result looks satisfying.

Summary

Summing up, we want to note that both Kubernetes and TensorFlow are young projects; there are definitely ongoing issues in different deployment scenarios. But our experiences show that with the MapR Converged Data Platform, we make running distributed deep learning tasks easier and more suitable in enterprise environments with our advanced file system features. With the lightweight container technology in place, we believe this is the right approach/tool for various deep learning R&D tasks. There is a lot of potential going forward. And it is truly the thriving open source communities that make such technology available to be used. We want to thank the Kubernetes and TensorFlow community and encourage more users to contribute.

With the distributed deep learning Quick Start Solution MapR offers, we provide the flexibility for users to choose their own deep learning tool, such as MXNet, Caffe and PyTorch. Utilizing a parameter server, we can launch the training task in a truly distributed fashion. From a machine server perspective to look at this quick start solution, since the deep learning models are all build into containers, we can easily move the model from dev environment to production environment. We can further manage the model version/dependency by creating meta tags with the container images.

Additional Resources

Read blog HOW TO USE DATA SCIENCE AND MACHINE LEARNING TO REVOLUTIONIZE 360° CUSTOMER VIEWS by Carol McDonald

Read blog DEPLOYING THE MAPR CONVERGED DATA PLATFORM ON AZURE CONTAINER SERVICE WITH KUBERNETES ORCHESTRATOR

Read blog TENSORFLOW ON MAPR TUTORIAL: A PERFECT PLACE TO START


This blog post was published May 23, 2017.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now