12 min read
Even if you haven't had a chance to check out TensorFlow in detail, it's clear that your choice of platform has a big impact just as it does for other machine learning frameworks. The adventure from trial to production involves many intermediate destinations, from feature engineering to model-building to execution and real-time evaluation. Even a model with the most spectacular F1-score is only as good as how effectively you can put it to use helping customers. Questions arise such as: do you need to evaluate against data for offline or online analysis (or both)? Where does the preprocessed (or feature-engineered) data live on its way to TensorFlow? Is there a way to preserve data lineage as it moves through the various stages to support both security concerns as well as easy debugging?
In this post we'll look at how to get a TensorFlow environment running on the MapR sandbox, which, as you'll see in the tutorial, just might be the "ultimate" starting point.
This combination is useful if you are early in the process and want to try out a few examples. You can run the sandbox on a well-equipped laptop and it will expose all of the MapR features so it's easy to envision how your application can evolve from concept to production use.
Follow these steps to build the single-node environment:
Download and run the MapR sandbox.
After starting the sandbox, you should see a banner screen on the console that says something like 'MapR-Sandbox ...'. Press Alt-F2 to select a new virtual terminal. You can run this tutorial from the console directly or via 'ssh' from another machine. This blog post has some good pointers for configuring networking on the sandbox to support ssh from another machine (usually the hosting machine).
Log in as 'root', with password 'mapr', to add the 'mapr' user to the 'sudoers' group as follows:
# echo "mapr ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
After logging out, log in again as the user 'mapr' with password 'mapr'.
Download the Tensorflow-on-MapR installation script.
wget https://git.io/vD8g5 -O tensorflow-sandbox-install.sh
You should now have a file called tensorflow-sandbox-install.sh in the current directory. Run the script as follows.
This script will download prerequisite packages including the 'bazel' build tool from Google which is required to build TensorFlow from source. In the process of building the packages, MapR services may be temporarily disabled and reenabled to ensure there are enough resources on the virtual machine. If you have more than 8GB RAM this is not strictly necessary but is done as a precaution to fit all environments. This procedure may take a few minutes to complete on slower systems or if you have a lot of other things running on the machine.
Congratulations! You should now have a fully functional TensorFlow setup running on MapR. Let's dive into an example.
The tflearn interface to Tensorflow is a convenient interface that is structured similarly to the well-known scikit-learn high-level API. This API is both expressive on its own and helpful in porting existing ML applications to TensorFlow. Additionally, many of the lower-level TensorFlow aspects are usable from this interface, making tflearn a good way to transition from scikit-learn to developing models of other shapes and sizes.
We installed tflearn in the above preparation steps, so the sandbox is ready to run our example code.
First, let's enable the TensorFlow environment and download the rest of the files for the tutorial.
scl enable python27 bash git clone [https://github.com/mapr-demos/tensorflow](https://github.com/mapr-demos/tensorflow) cd tensorflow
In this Python example we will use a recently released dataset from the US Department of Homeland Security relating to lost baggage claims in United States airports. We will develop a small application that can make predictions based on past yearly data. We will use it to answer the question, "Will this claim be accepted?" You can envision scenarios where this will be valuable: implementing automated claims, predicting the impact of future claims, or even as the basis of a mobile application where customers could get their claims instantly processed. Have you ever had to wait in line to make a damage claim? Wouldn't it be cool to have the whole thing done on your phone in the taxi ride back to the hotel?
DHS has several data sets for prior years. The raw data file for our example, 'claims_2002_2006.csv' is pulled directly from the public DHS site, converted from XLS to CSV in the repo for easy handling.
First we need to preprocess the rows, by handling categorical variables and doing the usual cleaning. Before we do that, however, let's copy the files to MapR-FS so they are replicated, and take a snapshot so we can always reference the original data set. This is a unique capability of MapR in that we can make copy-on-write, application consistent snapshots of any volume.
Let's create a new volume for landing the raw data, pull it from the repo, and take a snapshot of it.
$ maprcli volume create -name claimdata -path /cd $ cp claims_2002_2006.csv /mapr/demo.mapr.com/cd/ $ maprcli volume snapshot create -snapshotname origdata -volume claimdata
We now have an application-consistent snapshot of the original data so we can refer back to it if needed -- notice we also used a local NFS mount here and didn't use any 'hadoop' commands to ingest the data.
Next, preprocess the data and create the test and training sets by running the included script:
You should see something like the following output:
reading input file: /mapr/demo.mapr.com/cd/claims_2002_2006.csv raw data len: 97231 dropped 11286 rows with invalid values dropped 1302 very large claims dropped 0 remaining rows with invalid values classes: accepted/total: train: 38150/67706 test: 9559/16937 writing training file: /mapr/demo.mapr.com/cd/claims_train.csv writing test file: /mapr/demo.mapr.com/cd/claims_test.csv
Note that the two classes (after our mapping in preprocess.py) are fairly balanced out of the box, with roughly half of each data set representing one of the two classes. The two classes are 0 and 1, meaning "accepted" or "other" respectively. We did this to simplify the example -- if a claim was in any other case than 'accepted' (for example, still in progress, referred to a contractor claim, etc.) it was considered "not accepted". There are probably some nuances not completely captured here and it's an area for further exploration.
Now let's look at the first section of predict.py:
#!/usr/bin/env python from __future__ import absolute_import from __future__ import division from __future__ import print_function import tensorflow as tf import numpy as np # Similarly to the example in: https://www.tensorflow.org/tutorials/tflearn/ # we create a model and test on our own TSA Baggage Claims data. # separated train and test files from MapR-FS TRAIN = "/mapr/demo.mapr.com/user/mapr/claims_train.csv" TEST = "/mapr/demo.mapr.com/user/mapr/claims_test.csv" MODEL_DIR = "/mapr/demo.mapr.com/user/mapr/model" # load the data sets training_set = tf.contrib.learn.datasets.base.load_csv_with_header( filename=TRAIN, target_dtype=np.int, features_dtype=np.float32) test_set = tf.contrib.learn.datasets.base.load_csv_with_header( filename=TEST, target_dtype=np.int, features_dtype=np.float32)
TensorFlow provides two of its own functions for reading CSV files directly (as if there weren't enough already in Python). These have their own resulting data structures and even their own header format. Refer to the output in preprocess.py for how the header is constructed. In the above code we use these CSV functions to load our train and test sets.
Let's move on to the fun part... the model-building:
dim = len(training_set.data) feature_columns = [tf.contrib.layers.real_valued_column("", dimension=dim)] # make a 3-layer DNN classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns, hidden_units=[512, 256, 128], n_classes=2, model_dir=MODEL_DIR) # fit the model on the training data classifier.fit(x=training_set.data, y=training_set.target, steps=100)
In the above code we create a 3-layer Deep Neural Net (DNN) classifier with a very similar API to what we would do with a scikit-learn classifier. Note that we set model_dir to a path in MapR-FS. One of the benefits of TensorFlow is that the model can be easily saved to a file which you can load later and perform more iterations. This is another spot where MapR-FS snapshots can come in handy, and the ability to use the filesystem as random read-write capable, while fully replicating the data across the cluster, saves a lot of time.
Now that we have our DNN and training data consumed, the last step is to predict 0/1 (accepted/not accepted) values for our held-out training data. We've used an 80/20 split here, with 80% going to training and 20% going to test, which is a typical value.
# print an accuracy report acc = classifier.evaluate(x=test_set.data, y=test_set.target)['accuracy'] print('accuracy: %2.2f' % acc)
This last final code prints the overall accuracy. There are other metrics
Ready to try it? Run the script:
You should see output similar to the following (you may also see intermediate output from the TensorFlow code as it trains the model).
This can be interpreted as, "for approximately 64% of the data set the claim status was accurately predicted." Not bad for a few minutes work -- the model is correct nearly two thirds of the time. Can we do better? Some possibilities come to mind: more feature engineering? Building separate models for each airport? Throwing more hardware at the problem? TensorFlow lets you do them all but we'll leave those for for another blog post.
A handful of the core API functions take parameters that specify where to save certain information in files, such as logs for individual processes and checkpointed models. Many of the provided examples point to /tmp directories for this on the local filesystem, but this can lead to a sprawling mess of files when using a cluster. With MapR-FS mounted locally over NFS at each node, you can simply give these files a unique name and write them to the shared filesystem with almost zero performance impact. If a directory is needed, you can just point to a directory starting from /mapr to use the local NFS mount for the cluster. For example, when creating a Supervisor:
# Create a "supervisor", which oversees the training process. sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0), logdir="/mapr/demo.mapr.com/cd/train_logs", init_op=init_op, summary_op=summary_op, saver=saver, global_step=global_step, save_model_secs=600)
Similarly with writing logs that would ordinarily write on every machine.
logwriter = tf.train.SummaryWriter(logs_path, graph=tf.get_default_graph())
Even though we only scratched the surface of TensorFlow and MapR, you can see in this example that the MapR Data Platform provides an easy place to get started with machine learning. The platform will be there with you as your application grows into production. All of the code in this tutorial can be found on GitHub here. Have questions? Leave a comment below or talk to us in the community forum.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.