Deep Learning with NVIDIA GPUs + Oracle Cloud Infrastructure + MapR

Contributed by

4 min read

Using GPUs to train neural networks for deep learning is becoming commonplace. But the cost of GPU servers and the storage infrastructure required to feed GPUs as fast as they can consume data is significant. I wanted to see if I could use a highly reliable, low-cost, easy-to-use Oracle Cloud Infrastructure (OCI) environment to reproduce deep learning benchmark results published by some of the big storage vendors. I also wanted to see if a MapR distributed file system in this cloud environment could deliver data to the GPUs as fast as those GPUs could consume data residing in memory on the GPU server.

For my deep learning job:

  • I trained the ResNet-50 and ResNet-152 networks with the TensorFlow CNN benchmark from tensorflow.org, using a batch size of 256 for ResNet-50 and 128 for ResNet-152.
  • I used an OCI Volta Bare Metal GPU BM.GPU.3.8 instance, using ImageNet data stored on a 5 node MapR cluster, running on five OCI Dense I/O BM.DenseIO1.36 instances. The 143 GB ImageNet data was preprocessed into TensorFlow record files of around 140 MB each.
  • To simplify my testing, I installed NVIDIA Docker 2 on the GPU server and ran tests from a Docker container.
  • I used MapR's mapr-setup.sh script to build a MapR Persistent Application Client Container (PACC) from the NVIDIA GPU Cloud (NGC) TensorFlow container, so my container had NVIDIA's optimized version of TensorFlow with all their necessary libraries and drivers as well as MapR's container-optimized POSIX Client for file access.

Benchmark Execution

First, I ran one benchmark using data in the local file system. This loaded up the Linux buffer cache with all 143 GB of data.

Next, I ran the benchmarks through one epoch against this data with 1, 2, 4, and all 8 GPUs on the server. In the charts below, that's the "Buffer Cache" number.

Then, I cleared the buffer cache and re-ran the benchmarks pulling the data from MapR. I cleared the MapR Distributed File and Object Store caches on each of the MapR servers between each run to make sure I was pulling data from the physical storage media.

I got some of the best performance numbers I've seen for training these models, and the MapR performance was almost identical to in-memory reads on the local file server.

ResNet-50 Results

I used nvidia-smi, provided in the NGC container, to collect GPU utilization metrics on the 8 GPUs in the cluster to confirm the GPUs are working at full speed to process the data. These graphs show the GPU utilization for the 1 GPU and 8 GPU runs, pulling data from MapR.

ResNet-152 Results

And the 1 and 8 GPU utilization numbers from nvidia-smi for ResNet-152 were as follows:

For just a few dollars per hour, Oracle Cloud Infrastructure gives you the highest performing NVIDIA GPU enabled servers with highly available, reliable, and massively scalable MapR storage to perform machine learning tasks faster and more effectively than similar storage infrastructure solutions, with the latter priced orders of magnitude higher.

Additional Resources


This blog post was published October 12, 2018.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now