7 min read
It was raining, and my dog was bugging me to go for a walk. So I asked, "What kind of dog wants to walk in the rain?" And then decided: I could use image classification to find out! Seems simple enough, right? At the time, I figured it would take long enough to complete that the rain would have passed.
Well, it started easily enough by following Google's Tensorflow For Poets tutorial. This tutorial walks you through a transfer learning example that takes a model trained on the massive ImageNet dataset and retrains it on a more specific collection of images. The easiest way to get started on this process is to use the TensorFlow Flowers dataset, as there are many tutorials and examples using this combination, so that is where I began.
Deep learning libraries like Tensorflow, Keras, and MxNet have no concept of Hadoop or the HDFS API. The platform advantage to using these libraries against data stored in MapR XD comes from providing the ability to access the data in place, using Direct NFS storage mounts from a container using the FUSE/NFS capability. This functionality allows these libraries to treat distributed data as if it was stored locally.
This capability is not as widespread as some might guess, and many Data Science Workbenches actually require that you move the data into a containerization framework in the cloud in order to do work with it. There are multiple problems with this approach from an IT overhead perspective; but, an oft overlooked one is that as the model complexity increases, the amount of data that you need in order to train goes up multiplicatively. This data can become prohibitively expensive to keep in duplicate silos.
I'm going to jump ahead to the dog image classification example, but if you want to start with building this example on the Flowers dataset, I've provided steps for doing so here. Below is the output from a picture I took on my phone of some dying Valentine's Day flowers, so that I could validate the retrained model with an image that wasn't in the testing data.
The MapR Data Science Refinery provides the ability to use a container as a vehicle for client-side tooling while persisting data to the cluster. This is very useful for somebody like me who wants to get up and running quickly, as I can spin this up with full cluster connectivity in under 2 minutes. And, if I kill the container off or do something silly, it's okay because all of my data and notebooks are safely stored in my global namespace. Personally, I use an AWS edge node for this, but there are many ways to deploy this service:
As I started thinking about how to convert what I'd learned from retraining the ImageNet model on flowers to dogs, the first thing that I noticed is that the Stanford Dogs dataset is much larger than the Flowers dataset and contains 20,500 images of 120 breeds. This isn't gigantic as far as training datasets are concerned, but it's pretty large (~1GB unzipped) and processing these images would take a really long time using the defaults on my laptop or even on a typical edge node instance.
Luckily, a capability provided by the design of the MapR Data Science Refinery is the novel use of FUSE access to provide seamless access to data in MapR XD from the container. This allows the user to meaningfully separate compute from storage while using their global namespace to persist data:
I talk about this a lot but, finally, had a personal experience which really highlighted the value:
- The data is all in my cluster, not my containerized notebook instance.
- The model is in the cluster, not my containerized notebook instance.
- My notebooks are in the cluster, not my containerized notebook instance.
So, why not spin up another container with a GPU to do the (re)training? Sure, in an ideal world, the training would be distributed, but many of the Python libraries coming out don't have a distributed mode yet, and we work with what we have. GPU instances are expensive, but I won't need one for long.
So, I spin up a G3.XLarge instance in Amazon and spin up my Data Science Refinery container on top of it as an ephemeral training environment, and I install Tensorflow:
sudo -u root pip install tensorflow
Using the following command, I kicked off my training job, leveraging the model and the data stored in my global namespace:
python /mapr/my.cluster.com/user/mapr/retrain.py --image_dir /mapr/my.cluster.com/user/mapr/dogs/dog_photos --output_graph /mapr/my.cluster.com/user/mapr/dogs/ --output_labels /mapr/my.cluster.com/user/mapr/dogs/
On the GPU, the training time is down from the 4+ hours it was taking on the M4.XLarge instance to a much more reasonable 32 minutes. Since the retrained model has been persisted to my global namespace, I'm free to kill off this container and GPU instance and return to my much more affordable edge node to process images using the model and view the results in my notebook:
Interesting results, but I wouldn't trust a robotic groomer using this model to tell the difference between her and a poodle. But the thing is, I don't have to–I'm not a data scientist trying to solve a mission-critical use case. This runs well enough for my purposes out of the box, but with a little more effort and time, it could be tuned to improve the results. And that's what Data Scientists do.
But I'm a Product Manager. And the goal for a Data Science-focused product team should be to solve the logistical problems in a way that enables Data Scientists to spend their time working on improving model accuracy and generating business insights, instead.
Time to walk the dog.
Here's a video to summarize what we've covered:
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.