8 min read
In the MapR + NVIDIA Reference Architecture paper, I present an overview of deep learning architectures, both containerized and on bare metal, for training models, using data stored in a MapR distributed file system. For that paper, the tf_cnn_benchmark ran on an NVIDIA DGX-1 server with the TensorFlow framework, CUDA runtime, and NVIDIA libraries as well as the MapR FUSE-Based POSIX Client, all installed directly on the host operating system. In this blog, I'll show how to set up and run the TensorFlow framework in an NVIDIA GPU Cloud (NGC) Docker container with a secure MapR client, eliminating the need to install TensorFlow, NVIDIA Tools, and MapR components on the GPU server's operating system.
NGC provides a catalog of deep learning framework containers with all necessary dependencies including NVIDIA libraries and CUDA runtime. These containers make it easy to start using frameworks like TensorFlow, Caffe, and many others, without worrying about installing the correct versions of all libraries and framework components.
And MapR provides a Persistent Application Client Container (PACC) with a MapR FUSE-based client for accessing a MapR file system. This Docker container lets you securely access a MapR cluster without the need to install and configure the MapR client on the host operating system.
But I want to combine these two types of containers, so I can run deep learning frameworks with data stored in MapR on a GPU server without the need to install the correct framework dependencies and MapR client. The containers in the diagram below are what I want – NGC TensorFlow containers with a MapR client.
MapR provides a script (mapr-setup.sh) that can be used to build your own PACC, based on an existing container image. It also generates a script to set up the required environment variables to run the container.
Before you start, you'll need to have a GPU server with NVIDIA drivers and NVIDIA-Docker 2 installed. NVIDIA provides documentation here: https://github.com/NVIDIA/nvidia-docker. I'm using a BM.GPU3.8 instance on Oracle Cloud Infrastructure (OCI), where I installed Docker and the NVIDIA components.
And you'll need a MapR cluster with a user account. I'm running MapR 6.0.1 on five OCI BM.DenseIO1.36 instances. Oracle provides a Terraform template to deploy MapR on OCI: https://github.com/cloud-partners/oci-mapr.
If you don't already have an NGC account, create one at https://ngc.nvidia.com/. Select the "Generate API Key" button at https://ngc.nvidia.com/configuration/api-key. You'll need this key to log in to the NVIDIA Cloud Registry to pull the TensorFlow Docker image.
(In the shell examples below, the Linux prompt is bold, and my commands are green.) From the GPU server, log in to the NVIDIA Cloud registry and pull the container you want to use. You must select a container image that corresponds to your installed NVIDIA driver. I use the nvidia-smi command to check the installed driver version.
The tensorflow:18.09-py3 image is the latest available as of this writing. Checking the Driver Requirements section of the release notes for this image, I see that it requires 410.xx, so this image will be fine for my system.
Your Linux user will need to be a member of the Docker group to run Docker commands below.
Now, download and run mapr-setup.sh to create a MapR client container, based on the TensorFlow container we just pulled. When prompted for the "Docker FROM base image name:tag," specify the TensorFlow image we just pulled from NGC. For clarity, I'll give the new image a name containing TF to indicate that it's a TensorFlow image. I'll accept defaults for all other entries. You'll see a lot more output than I'm showing, but after successfully building the Docker image, you'll see the message to edit mapr-docker-client.sh.
From a MapR client or server machine, use maprlogin to get your user ticket and copy it to the GPU server. Be sure permissions are set, so only you can read the ticket. If you happen to have the MapR client installed on your GPU server, you can run maprlogin to create the ticket and place it where you want with the –out option. I'll create my user ticket in my home directory at /home/andy/mapr_ticket.
Modify environment variables in the mapr-docker-client.sh script that was just generated by mapr-setup.sh, so the container will be able to access the MapR cluster. I'm showing the setting for my cluster. Yours will be different.
If your Linux user identity on the GPU server is the same as your user identity on the MapR cluster, mapr-docker-client.sh will set your Linux user identity in the container. However, if your user identity on the GPU server differs from your identity on the MapR cluster (and in your MapR ticket), you'll need to specify those to the container. I set up my user identity on the GPU server differently from that on the MapR cluster, so I need to set these environment variables to match my identity on the MapR cluster.
Finally, specify the NVIDIA-Docker runtime. If desired, you can also specify other parameters to the Docker run command with MAPR_DOCKER_ARGS. Make sure you edit the first instance of this variable in the mapr-docker-client.sh script.
Now I can start up the container with the mapr-docker-client.sh script. If I don't pass a command to the script, I get a shell prompt within the container. The prompt shows my user name and hostname. For the container, the hostname is the Docker container ID as seen from the Docker ps command.
The MapR file system can now be accessed using the mount path and cluster name specified previously.
From Python, confirm that a TensorFlow session can be created.
You can also specify an existing Python program and parameters directly, when starting the container. Here's a command to run the TensorFlow CNN benchmark that I have installed on my cluster using ImageNet data, also on the cluster. This will run the Docker container and invoke the benchmark to train the resnet50 network. I'm using the TensorFlow 1.10 compatible version of the benchmark because the release note for the NGC container shows that it contains that version of TensorFlow.
Now you can build containers to securely access MapR from any NGC container. I showed how to build a TensorFlow container, but you can use mapr-setup.sh to build a secure MapR client container for Caffe, MXNet, PyTorch, or any of the framework containers provided by the NVIDIA GPU Cloud.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.