Deploy Distributed Deep Learning QSS on MapR GPU Cluster, Part 1

Contributed by

11 min read

A Step-By-Step Guide with Kubernetes 1.7 and MapR 5.2.1

Editor's Note: This is the fifth installment in our blog series about deep learning. In this series, we will discuss the deep learning technology, available frameworks/tools, and how to scale deep learning using big data architecture. Read Part 1, Part 2, Part 3, Part 4, and Part 5b.

Preface: This installment is a two-part companion blog to the MapR Distributed Deep Learning Quick Start Solution (QSS). The content presented in this first half contains technical instructions to create the deep learning environment, which is a component of the main offering (that also includes consultation, training, solution design and execution). In this blog, instructions for installing the components and creating the environment are presented. In the [second half] (Deep Learning QSS, part 2), launching and monitoring jobs will be covered.

Requirements: Access to cloud GPU instances to create the deep learning environment. For this blog, AWS was used as an example. We set up 3 g2.2xlarge nodes as GPU nodes and 1 m4.2xlarge node as a master node. We used Ubuntu 16.04 here, but it will work as well on Redhat and CentOS. The cost for g2 or g3 instance is 0.65 USD per hour while m4.2xlarge is 0.4 USD per hour. For better performance, users can certainly upgrade to p2 or g3 instance for better GPUs or use in-house GPU machines. There are also equivalent GPU instance offered by Azure and Google cloud; users can choose based on their preference.

Introduction

The MapR Distributed Deep Learning QSS combines an enterprise-ready distributed file system with Kubernetes to train and deploy deep learning models at scale on a heterogeneous GPU cluster. Deep Learning algorithms are generating truly revolutionary results on historically-difficult problems - the challenge to business has been reliably training the very complex models and deploying them to realize those results.

Due to complicated nature of deep learning architecture, usually a checkpoint reloading strategy is used instead of complicated fault tolerant strategies (like Apache Spark). Leveraging the MapR distributed file system, a user can easily provide the persistent storage and checkpoint for Kubernetes and Tensorflow. MapR provides enterprise level NFS interface to MapR Distributed File and Object Store, which can work easily with Kubernetes; more information could be found here. We will demonstrate the steps to distributed Deep Learning on the MapR Data Platform.

Install MapR

First, you need to install MapR on the cluster. For simplicity, we put all MapR services on the master node, leaving the GPU for computing.

#set up clustershell and passwordless ssh

apt-get install -y clustershell screen
vi /etc/clustershell/groups
all: ip-10-0-0-[226,75,189,121].ec2.internal
cldb: ip-10-0-0-226.ec2.internal
zk: ip-10-0-0-226.ec2.internal
web: ip-10-0-0-226.ec2.internal
nfs: ip-10-0-0-[226,75,189,121].ec2.internal
gpu: ip-10-0-0-[75,189,121].ec2.internal
ssh-keygen -t rsa
for i in ip-10-0-0-226.ec2.internal ip-10-0-0-75.ec2.internal ip-10-0-0-189.ec2.internal ip-10-0-0-121.ec2.internal; do ssh -i /home/ubuntu/mapr-dm.pem $i; done
cat ~/.ssh/id_rsa.pub | ssh -i /home/ubuntu/mapr-dm.pem root@ip-10-0-0-226.ec2.internal 'cat >> .ssh/authorized_keys'
cat ~/.ssh/id_rsa.pub | ssh -i /home/ubuntu/mapr-dm.pem root@ip-10-0-0-75.ec2.internal 'cat >> .ssh/authorized_keys'
cat ~/.ssh/id_rsa.pub | ssh -i /home/ubuntu/mapr-dm.pem root@ip-10-0-0-189.ec2.internal 'cat >> .ssh/authorized_keys'
cat ~/.ssh/id_rsa.pub | ssh -i /home/ubuntu/mapr-dm.pem root@ip-10-0-0-121.ec2.internal 'cat >> .ssh/authorized_keys'
clush -a 'apt-get update -y'

#start to install MapR
clush -a 'apt-get install -y  openjdk-8-jdk'
clush -a "echo never > /sys/kernel/mm/transparent_hugepage/defrag"
clush -a "cat >> /etc/security/limits.conf <<EOL
mapr soft nofile 64000
mapr hard nofile 64000
mapr soft nproc 64000
mapr hard nproc 64000
EOL"
clush -a "groupadd -g 5000 mapr"
clush -a "useradd -g 5000 -u 5000 mapr"
passwd mapr

clush -a " wget -O - https://package.mapr.com/releases/pub/maprgpg.key | sudo apt-key add -"
clush -a "cat >>  /etc/apt/sources.list << EOL
deb https://package.mapr.com/releases/v5.2.1/ubuntu binary trusty
deb https://package.mapr.com/releases/MEP/MEP-3.0/ubuntu binary trusty
EOL"

clush -a 'fdisk -l'
clush -a "cat >> /root/disks.txt << EOL
/dev/xvde
/dev/xvdc
/dev/xvdd
EOL"

clush -a apt-get update -y
clush -g zk apt-get install -y mapr-cldb mapr-zookeeper mapr-webserver
clush -a apt-get install -y mapr-core mapr-fileserver mapr-nfs
clush -a /opt/mapr/server/configure.sh -C `nodeset -S, -e @cldb` -Z `nodeset -S, -e @zk` -N DLcluster -M7 -no-autostart
clush -a "ls /root/disks.txt && /opt/mapr/server/disksetup -F /root/disks.txt"
#make sure the folder is here
clush -a sed -i "'s/#export JAVA_HOME=/export JAVA_HOME=\/usr\/lib\/jvm\/java-1.8.0-openjdk-amd64\/jre/g' /opt/mapr/conf/env.sh"
clush -a mkdir -p /mapr
clush -a 'echo "localhost:/mapr  /mapr  hard,nolock" > /opt/mapr/conf/mapr_fstab'

clush -a systemctl start rpcbind
sleep 2
clush -g zk systemctl start mapr-zookeeper
sleep 10
clush -g zk systemctl status mapr-zookeeper
clush -a systemctl start mapr-warden
maprcli node cldbmaster
now register the cluster
clush -a 'mount -o hard,nolock localhost:/mapr /mapr'

Before you mount the disk, you might want to register your cluster and apply the enterprise trial license; then, restart the NFS on each node, which you can do through the MCS web interface. To register the cluster: https://community.mapr.com/docs/DOC-1679.

At this point, you should have a running MapR cluster. Since we didn’t install any ecosystem components, it should be fairly simple and basic. If /mapr folder is not mounted to the MapR Distributed File and Object Store, run “clush -a 'mount -o hard,nolock localhost:/mapr /mapr.’” Also, you should set the mapr subnet in /opt/mapr/conf/env.sh; add “export MAPR_SUBNETS=10.0.0.0/24.”

Install Kubernetes

Install Kubernetes 1.7

Next, you need to install the Kubernetes Master on the CPU node and workers on the GPU nodes. With Kubernetes 1.5.2 and earlier, there are manual procedures. For Kubernetes 1.6 and later, we will use Kubeadm to config and spin up the cluster.

Next, you need to install the Kubernetes Master on the CPU node and workers on the GPU nodes. With Kubernetes 1.5.2 and earlier, there are manual procedures. For Kubernetes 1.6 and later, we will use Kubeadm to config and spin up the cluster.

clush -a apt-get update && apt-get install -qy docker.io
clush -a apt-get update && apt-get install -y apt-transport-https
clush -a 'curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -'
cat <<EOF >/etc/apt/sources.list.d/kubernetes.list
deb http://apt.kubernetes.io/kubernetes-xenial main
EOF
clush -a apt-get -y update
clush -a "apt-get install -y kubelet kubeadm kubectl kubernetes-cni"
clush -a “cat >> /etc/systemd/system/kubelet.service.d/10-kubeadm.conf << EOL
Environment="KUBELET_EXTRA_ARGS=--feature-gates=Accelerators=true"
EOL”

clush -a "systemctl enable docker && systemctl start docker"
clush -a "systemctl enable kubelet && systemctl start kubelet"

kubeadm init --pod-network-cidr=10.244.0.0/16  --apiserver-advertise-address=10.0.0.226
cp /etc/kubernetes/admin.conf $HOME/
sudo chown $(id -u):$(id -g) $HOME/admin.conf
export KUBECONFIG=$HOME/admin.conf
echo "export KUBECONFIG=$HOME/admin.conf" | tee -a ~/.bashrc

kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel-rbac.yml
kubectl create -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
kubectl taint nodes --all node-role.kubernetes.io/master-
kubeadm join --token c44f75.d6a7a3d68d638b50 10.0.0.226:6443
export KUBECONFIG=/etc/kubernetes/kubelet.conf

kubectl create -f https://git.io/kube-dashboard
kubectl proxy --port=8005 --accept-hosts='^*$'

Then, on your local machine, use ssh tunnel to access the Kubernetes dashboard with: ssh -N -L 8005:127.0.0.1:8005 UbuntuK, whereas

Host UbuntuK
    HostName ip-10-0-0-226.ec2.internal
    User ubuntu
    Port 22
    IdentityFile ~/Documents/AWS/mapr-dm.pem

Then, go to http://localhost:8005/ui to access the dashboard.

Install Nvidia Libraries

Finally, to enable deep learning applications, we need to install a Nvidia driver with Cuda and Cudnn on all the GPU nodes. The driver version will be different, given the GPU cards in use.

clush -g gpu 'apt-get -y install build-essential cmake g++'
clush -g gpu "cat >> /etc/modprobe.d/blacklist-nouveau.conf<<EOL
blacklist nouveau
options nouveau modeset=0
EOL"
clush -g gpu update-initramfs -u

on each node
wget https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v5.1/prod_20161129/8.0/cudnn-8.0-linux-x64-v5.1-tgz
wget https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda_8.0.61_375.26_linux-run
wget http://us.download.nvidia.com/XFree86/Linux-x86_64/367.57/NVIDIA-Linux-x86_64-367.57.run
tar -xvf cudnn-8.0-linux-x64-v6.0.tgz -C /usr/local
cp /usr/local/cuda/lib64/libcudnn* /usr/local/cuda-8.0/lib64/.
bash run the two run files

clush -a "cat >> /root/nvidia_startup.sh <<EOL
#!/bin/bash
/sbin/modprobe nvidia
if [ "$?" -eq 0 ]; then
# Count the number of NVIDIA controllers found.
   NVDEVS=`lspci | grep -i NVIDIA`
   N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
   NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`
   N=`expr $N3D + $NVGA - 1`
   for i in `seq 0 $N`; do
        mknod -m 666 /dev/nvidia$i c 195 $i
   done
   mknod -m 666 /dev/nvidiactl c 195 255
else
    exit 1
fi

/sbin/modprobe nvidia-uvm
if [ "$?" -eq 0 ]; then
    D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`
    mknod -m 666 /dev/nvidia-uvm c $D 0
else
    exit 1
fi
EOL"
Execute this bash script on the GPU nodes to set up the nvidia devices.

export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

nvidia-smi should give you the GPU info, and we use that info to label the kubernetes nodes.
kubectl label nodes ip-10-0-0-189 alpha.kubernetes.io/nvidia-gpu-name=GRID_K520
kubectl label nodes ip-10-0-0-75 alpha.kubernetes.io/nvidia-gpu-name=GRID_K520
kubectl label nodes ip-10-0-0-121 alpha.kubernetes.io/nvidia-gpu-name=GRID_K520

At this point, we have a running GPU cluster with MapR and Kubernetes 1.7.

GPU Cluster

If we use kubectl to describe the nodes, we should be able to see the GPU capacity under different nodes.

for cpu nodes: kubectl describe node ip-10-0-0-226:

Capacity:
 alpha.kubernetes.io/nvidia-gpu:        0
 cpu:                                   8
 memory:                                32946584Ki
 pods:                                  110
Allocatable:
 alpha.kubernetes.io/nvidia-gpu:        0
 cpu:                                   8
 memory:                                32844184Ki
 pods:                                  110

for gpu nodes: kubectl describe node ip-10-0-0-75:

Capacity:
 alpha.kubernetes.io/nvidia-gpu:        1
 cpu:                                   8
 memory:                                15399284Ki
 pods:                                  110
Allocatable:
 alpha.kubernetes.io/nvidia-gpu:        1
 cpu:                                   8
 memory:                                15296884Ki
 pods:                                  110

Conclusion

To summarize, we have installed a MapR MFS-only cluster to provide the distributed data layer, and we have installed Kubernetes 1.7 as the orchestration layer. We enabled Kubernetes to manage the GPU, CPU, and memory resources on each node in cluster. In the second half of this blog, we will configure the persistent storage to link the MapR file-system with Kubernetes pods and demonstrate distributed deep learning through training examples.

If you’d like to understand more about deep learning, its applications via MapR or receive a consultation, please send inquiries to sales@mapr.com.

Additional Resources


This blog post was published July 27, 2017.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now