12 min read
Editor's note: This is the fifth in a series of blog posts on how to build effective AI and machine learning systems. The previous blog post is titled "Data Science vs Computer Science."
AI and machine learning have huge potential value, but in order to deliver on this promise, it’s important to lower the cost of building and operating these systems. That includes lowering the cost in terms of the effort required to build and maintain them. One area where your choices can make a big difference is in the cost and effort of managing the data used for machine learning, particularly in terms of version control for the data that you used. This blog post takes a look at why data versioning is needed for machine learning and AI systems and how to do it easily and efficiently.
Data processing and data management form a big part of machine learning logistics. As explained in the previous blog post in this series, data is involved early in the process of building a model. Remember, a trained model isn’t just human-designed code: it involves parameter values that are discovered through exposure to training data. For this reason, and because we may need to be replicate how a model was trained, it is particularly important to know exactly what data was used to train our models - not just what types of data, but exactly which bits were used in training.
This need to archive exact versions of data also applies to so-called "held-out" data that is used for evaluation. This test data is usually a specified subset of the original training data, set aside before the model is trained and then used as a way to evaluate model performance including accuracy, as illustrated in the following figure. In this way, the test data is a random part of the actual training data rather than a benchmark data set. That random selection of test data from a training data set adds yet another level of difficulty to replicating our model build process and is yet another reason why data versioning is needed.
A trained model is the result of a learning process in which code interacts with training data. Some data is held out from the training data to be used later to test a trained model. It’s important to know exactly what data was used to train and evaluate each model in order to compare and improve model performance.
It’s immediately obvious from the above figure that there are at least a couple of different data versions during the development of a trained model that should be recorded in some way - the 90% selection of training data used to build the model and the 10% held-out so that it can be used to evaluate the model. But in best practice, this is only part of the picture. This random split of training data into data used to train the model versus data held-out for evaluation is done multiple times to achieve cross validation. Think of it this way: If the 10% selected for testing happened, accidentally, to be very different from most of the data set, your evaluation results would be misleading. You would get a skewed evaluation and not realize it. And this type of skewing can happen very easily. Only completely evenly distributed (homogeneous) data would be guaranteed to avoid that problem, and in the real world, that’s not what you see. Another way to say this is, some rare events must be clustered, or they would not be random.
In order to cope with this potential problem, you can simply repeat the split-train-test process multiple times such that you are using a different subset to train and to test each time. If you choose to hold out 10%, then typically you would repeat the process 10 times, such that all the data is eventually used either as test or training data. This cross-validation process is a way to understand how much variability there is across the data set, and including cross validation in the training phase can substantially improve results.
But with the cross-validation process, you now have even more information that needs to be versioned. And that’s just what is needed to produce each approach to a trained model. You will likely do this many times, with slight adjustments to the algorithm or different machine learning tools. As you can see, getting models into production calls for a lot of data management and data versioning.
As mentioned in the previous blog post in this series, machine learning and AI are iterative processes. This iterative nature is true not only in the development phase, when multiple models are trained, evaluated, adjusted and retrained, but also after models are deployed in production. Evaluation is an on-going process because even the lifespan of a well-performing model may not be long because the world will change, and with it, results may begin to vary. With on-going evaluation, you’ll be able to tell when to roll a new model into production and possibly retrain the previous model(s).
Machine learning is an iterative process: model programs are trained, evaluated, adjusted and re-trained, re-evaluated until they appear to be fit for production. Even after deployment in production, running models should be monitored and re-evaluated in an ongoing way because as the world changes, even a previously good model may no longer perform at desired levels.
All those iterations also need to be documented, both in terms of code and data. Management can become a huge task unless much of the logistics are pushed down to the data platform.
There are two aspects to versioning relative to machine learning models: versioning code and versioning data. Versioning code is a familiar concept for any software development and with machine learning, it works pretty much as you would expect. Likely you use something like Git to version code and keep a record of when and by whom it was modified:
Code versioning via Git is a familiar way to track code modifications including branches by other developers. Code versioning is an important aspect of model development and management for machine learning.
In addition, you’ll want to keep a record of what you’ve done and where it is stored, especially so that you can repeat what you’ve done or share it with collaborators. Some of this can be done via a notebook tool such as Apache Zeppelin or Jupyter Notebook (both of which git under the covers!).
When developing machine learning models, you also need to do the analogous thing with data. The output of the training process can, usually, be versioned via Git, but the training data itself often is too large to make this practical. That’s where having an efficient dataware layer can make a big difference. Versioning training and testing data should be a capability built into your data platform.
An example of technology engineered to handle data versioning is the MapR Data Platform. MapR is a highly scalable, highly available and reliable distributed file system with advanced data management capabilities. Files, tables and a messaging system (event streams) are built into the software as first-class objects. Management of these different data structures is handled conveniently via volumes. A MapR volume is a logical unit that acts as a directory with "super powers". It allows you to apply policies to files, tables, and streams all together as well as serving as the basis for efficient mirroring, within a cluster or across data centers, on premises, and in the cloud.
What do volumes have to do with data version control? The volume is the basis for MapR snapshots, true point-in-time versions of data. A MapR snapshot is not "leaky"; once made, it points to an exact, unchanging version of data.
Data versioning works easily with MapR snapshots. A snapshot is a true point-in-time version of data in a management tool known as a MapR volume. Volumes are essentially directories with "super powers", containing files, tables, and streams.
Because MapR snapshots point to original data rather than being a copy, they aren’t expensive in terms of storage space or time. (If the original data is changed, then the snapshot retains a copy.)
Snapshots in the MapR Data Platform capture incremental file changes by referring to old versions of blocks.
The combination of these capabilities are important for the data platform that supports machine learning and AI systems: (1) providing direct data access to a wide variety of machine learning tools, without having to copy data out of the distributed storage to another platform (2) doing data versioning easily via snapshots (3) persisting data from containerized applications, such as those orchestrated by Kubernetes. The MapR Data Platform does all of these things, in one system, on-premises or in cloud deployments.
In a recent webinar, "A Guide to Version Control for Machine Learning," Eero Laaksonen and Juha Kiili of the Finnish startup Valohai explained how they build data pipelines that support repeatable machine learning build processes and how they handle data versioning as a key part of the task of managing the build processes.
Using three different real-world examples in the webinar, they showed how each model is associated with meta-data that includes how the model was built and, importantly, exactly which data that was used to build it. When Valohai’s system runs on the MapR Data Platform, it uses MapR snapshots to keep track of all the versions of the training data that have been used.
In the Valohai system, the entire build process for machine learning is managed, including version control for the training data. Image from the webinar.
The webinar also includes a presentation by Ian Downard explaining how MapR snapshots work and providing a demonstration of data version control via snapshots.
To learn more about best practices for AI and machine learning, try these free resources:
Whiteboard Walkthrough video: "How MapR Enables Multi-API Access to Files, Tables, and Streams"
Free on-demand training: Introduction to Artificial Intelligence and Machine Learning
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.