A Guide to Version Control for Machine Learning


Speakers:

Eero Laaksonen

CEO, Valohai

Juha Kiili

Senior Software Developer, Valohai

Ian Downard

Senior Developer Evangelist, MapR Technologies


It's a well known fact that Machine Learning (ML) requires a lot of trial and error. Experimentation is key. The procedures that people use to prepare training data and tune training parameters are very iterative. In order to facilitate this kind of software development you have to track the code, configurations, and data used for ML experiments so you can always answer the question of how a model was trained. However, large training datasets often preclude traditional version control software from being used for this purpose. In these cases, MapR Snapshots provides a highly attractive solution for data versioning.

Watch this on-demand webinar to see how MapR Snapshots work and demonstrate them with Valohai, which is software designed to manage ML experiments with robust version control for models, model configs, and training data.

In this webinar you will learn:

  • How to perform data versioning in files, tables, and/or streams with MapR Snapshots.
  • How to perform ML experiments where you want to vary training parameters but work from a known and unchanging version of training data.
  • How to achieve ML audit compliance, so you’re better able to explain how production models were derived and which data versions you used to test against important things like gender and racial bias.

Transcript

Eero Laaksonen: 00:09 Thank you. Thank you, but yeah, today, we'll be talking about version control for machine learning, and we'll start off with a few case examples to see how version control is needed in the real world when dealing in production machine learning systems. We're going to go through three different examples. Here, first, we'll be talking about a bias example where we'll have to see a problem with potential bias detection and audits on bias testing where let's say you have given out a loan decided by a machine learning system, and then 12 months later, a person requests an audit to see if you have done the required minimum steps for racial and gender bias in your model testing before putting it into production.

Eero Laaksonen: 01:17 Then, the second example will be privacy related, so let's say somebody wants to completely opt out of all of your data related to GDPR, which requires you to retrain all of your models that are aggregated on top of private data, so going back to review what models are using that private data. Then, lastly, a correlation problem where let's say we have done an insurance pricing decision, and there's already certain requirements to be able to show correlation between the affected factors, so affecting factors of pricing between the actual price of insurance, and if that's done on top of a machine learning model. We'll be talking a little bit about how these kinds of things will require version control for machine learning.

Eero Laaksonen: 02:31 Firstly, to be able to tackle this, we have to look a little bit on the on the pipeline on how machine learning is actually done in production today. First of all, you have the data, and then you have some sort of scripts that usually do data transformation. That data is pulled to the feature extraction or data transformation scripts, and then, that is saved in some sort of format for them to be used by some sort of training. There's often multiple different versions of scripts or different types of models that are tested and trained on top of the data. Then, usually, that trained model is testing some sort of way, and then finally, when the model is ready and tested, it is put into production. Then, there's users that are doing some sort of request and getting predictions out of the deployed model.

Eero Laaksonen: 03:46 Now, if we look at this kind of an environment where the data flows first through different scripts and then through testing and goes into production and we think about the three examples that we had, we can start seeing some sort of pattern on what we should save from version control perspective to tackle these different examples. If you look at the bias first, so we had a problem where 12 months ago, we have done some sort of machine-based decision with a machine learning model about somebody's loan. First of all, we did some sort of request on the production side and make some sort of prediction. We need to be able to go back all the way to the actual model training to see what data we were using, what kind of racial bias or gender bias testing we did during that step, so we have to be able to go from the model to the actual testing scripts that were run on top of the model, and then go back from that to the actual training of the model, what was the script that we used, was there some sort of prevention for any kind of bias, and then all the way back to the feature extraction and the raw dataset that we were using at the time to be able to show in an audit that we actually did the minimum requirements to tackle these issues.

Eero Laaksonen: 05:22 You will need to be able to go from the actual production model all the way back flowing through this pipeline to your data. Then, the second example is privacy. Now, a user that has given you data wanted to remove their data from your systems, so you have to be able to go to those actual datasets that contain that person's private data and then be able to flow all the way to the in-production models and retrain all of those models and retrain all of those pipelines to be able to remove all of the aggregated data on top of the user. That's sort of a different type of example where we have to be able to go from data all the way to the in-production models and tell that, okay, this volume of data that we have from this particular user needs to be deleted. It's a part of this kind of production pipeline for this particular model, so again, we need to be able to go from any kind of datasets all the way to the in-production model and show that these were the models that they're using for data.

Eero Laaksonen: 06:39 Then, lastly, the correlation example where again, we did some sort of production decision on pricing in real time for a user, and we have to be able to tell that, okay, 12 months ago or six months ago, this particular version of a model was in production, and the story is very similar to the first one. You have to be able to flow back to testing and training, and then in training, be able to show that these were the selected features, and they actually have correlation between the data and all of the features, and the actual output of the model.

Eero Laaksonen: 07:20 These are very complicated examples that you already have regulation out there that you have to be able to tackle. The only way to really do that is to have your pipeline set up very robustly. The problem that you obviously have today is that most of the regulation that's going to be out there in the next two to five years doesn't exist yet, but if you look at these three sort of example cases, we can see that your best bet for an audit in the future is good version control and repeatability of your pipeline. You have to be able to go and flow back and forth in your pipeline to be able to comply with these kinds of regulation requirements in the future.

Eero Laaksonen: 08:12 Then, if you look at an individual step of the pipeline, there's a bunch of stuff that needs to be saved, and I have gathered on this slide a few main topics on what's needed when you're running any kind of step of the pipeline. First of all, they use it to know things ... I use the training code and the actual model, but like we already talked earlier, the dataset is equally important, and the versioning of the data is equally important because the model is actually a combination of code and data, but then you're also running that code usually over multiple parameters, so we're doing hyperparameter optimization. Also, on the input side, you should be able to always tell that this particular model that we generated was a combination of this training code with these parameters and this particular dataset.

Eero Laaksonen: 09:16 Then, if we start looking at what happens during execution time, there's a bunch of important stuff out there too. First of all, the used hardware, so let's say you get new people on your team and they want to run some script or some training script that was run by someone else, possibly a person who already left the company or so on, it can be really complicated to see what kind of hardware is required especially if you're doing more complicated, for instance, deep learning tasks where you might have five different nodes that have a random number of GPUs required. It can be really hard to set up these clusters later on for repeated training, so you should always have some sort of information on the actual used hardware for any kind of steps to reproduce them.

Eero Laaksonen: 10:06 Then, package management or environment, for instance, docker and containers are providing a great tool for this. Then, thirdly, experiment cost, and this is something especially if you're working on the cloud that can ... it might seem like, yeah, you get all the cost, but what cloud providers really provide you is the overall price for, let's say a month of computation, but to be able to really tell that this project and this particular experiment cost us this much money, that's a big deal in traceability of your work.

Eero Laaksonen: 10:47 Then, on output side, the actual model obviously but also all the log outputs during execution time to be able to do debugging. Then, results for your model, this can be very complicated. In many, many cases, it's done manually using something like Google Sheets or Excel can be really hard especially when experiment amounts go up. Then, lastly, the hardware statistics. We see often cases where customers are utilizing some large computation hardware, but they're not able to really transparently see how much GPU memory they're using or CPU power they're actually utilizing, so it's good to have an understanding on how much machine performance you were actually utilizing. Maybe you should increase batch size and so on.

Eero Laaksonen: 11:47 Here's the sort of information and metadata that that relates to the machine learning, and all of this should be somehow tracked during and throughout the whole pipeline. Obviously, one of the most complicated parts of this is the actual data system and managing the versioning of the data. For that part, I'm going to hand it over on the MapR side.

Ian Downard: 12:12 Thanks, Eero. This is Ian Downard. I'm just going to share my desktop, and can somebody confirm that you can see me?

David: 12:25 Yeah we can see it, Ian.

Ian Downard: 12:26 Okay, and you can see my slides?

David: 12:32 Yep.

Ian Downard: 12:33 Okay, great. As I said, my name is Ian Downard. I'm a developer evangelist for MapR, and just going to continue on that thread. We're going to kind of double click on what it means to do version control for machine learning, why it's important, and what the crux is or the most difficult part is.

Ian Downard: 12:52 Machine learning is something that involves lots of trial and error. It's not something that you would ... You would think it's not often talked about. Most of the hype around machine learning are about algorithms, and most of the publications that you read that come out of academia are about algorithms, but whenever you try to deploy these things to production, the very first thing that you have to decide, like the first decisions you're going to be making are not decisions that are qualitative decisions. They're kind of experimental decisions. I'll just double click on a machine learning scenario, and you'll see what I mean.

Ian Downard: 13:37 This past year, I've been doing a lot of work with time series forecasting. You could think of like stock ticker symbol like trying to forecast the next price for a stock or in my case, it was predictive maintenance. I was trying to predict when machines would fail based on IoT sensors that were providing any number of data points every second or every couple seconds. It's a time series dataset, and in order to forecast the next value of the time series or forecast anything about it, a lot of times, you'll use recurrent neural networks, and whenever I first started learning about this, I had immediately questions about, okay, there's lots of different recurrent neural networks. Which ones should I use?

Ian Downard: 14:31 I ultimately went with something called LSTM, which stands for long short-term memory networks, but there was nothing, no guidance that said LSTM is the best, it's going to give you the most accurate results, the fastest inferences. There was nothing about LSTM that I could see that would prove it would be the best case for my scenario. This is common for almost every situation where you're trying to do machine learning. You really have to experiment, and maybe hopefully somebody else has gone down a path and given you some data that shows that a certain algorithm approach works for them, and their scenario's close to yours, but because we're all in different businesses and verticals, it's really hard to do this especially if you're trying to be competitive and come up with unique ways of doing it.

Ian Downard: 15:23 Even once you decide what algorithms to use, you got a lot of questions such as like feature selection. Features are those signals. They're basically variables and data, and you're trying to create a model that would generalize things like correlates certain sensor values to machine failures or correlate certain market derivatives to a particular stock. Selecting what variables correlate to these events or variables that you're trying to predict is called feature selection, and it's not at all ... It's a process of trial and error to figure out what variables are going to correlate strongly with the events that you're trying to predict.

Ian Downard: 16:13 In addition, for time series data, you feed the model with chunks of prior windows in the time series like how far back do you go that. That input, like you can't feed it weeks' worth of time series data. The model doesn't work that well, so there's like a sweet spot between maybe 10 minutes to 10 hours, but it's really broad. The only way you can really measure how well it's going form is to try things out. In addition, so what I showed in the previous slide was an image of a recurrent neural network. In that image, you've got a chain of what they call hidden layers. You chain together these different components, but you can also stack them like we've got in this window. How many you chain together and how many stack is, again, another part that you discover through trial and error.

Ian Downard: 17:18 Then, there's parameters to each of these networks, so trial and error is just pervasive throughout the whole process. All those configurations, all those different models, model controls, the neural network code that you write, that's easy to version control. We're used to using things like GitHub to do that. That source code, it fits nicely into the traditional version control technology that we're used to using for any kind of source code, but when it comes to data, it's much different. For example, GitHub has a limit on the map's file size of 100 megs. That clearly isn't going to work for training data.

Ian Downard: 18:02 In addition, you can't just copy data like ... If you're trying to back up training data or create versions of it by copying it, you're really going down the wrong path because copies are slow., copies take up too much space. They easily wouldn't fit on a single machine. The more versions you're going to be creating, the more that data is going to grow, and you can never really delete it. Furthermore, you can't check in massive quantities of files or large files into traditional version controls, so snapshots are really much, much better.

Ian Downard: 18:43 Snapshots are a capability that's been part of enterprise grade data storage systems for a long time, and they're generally really fast. You can create a snapshot extremely fast for a large data volume. They initially don't take up any space because they're just creating pointers to blocks in them. Really, they're not actually copying any data, and I'll describe this further. Furthermore, these snapshots can be identified with unique identifiers, and you can check that into version control along with all your models, model configurations.

Ian Downard: 19:27 Another nice thing, depending on the storage system, snapshots can preserve file ownership properties, so if you have a particular user that's got permission to see particular datasets, that user ... you can have file access control defined to limit who can see that data. When you create file copies, the ACLs or the access controls there are changed depending on who, what user was that's actually changing the file or copying the file. With snapshots, it preserves all those ownership properties, so there's security benefits with snapshots as well.

Ian Downard: 20:03 MapR has built snapshots as kind of a first-class citizen in the platform so that anything that runs on that bar can benefit from them, so before we go much farther, I should just introduce what is MapR. MapR is an enterprise-grade data storage system. It basically scales extremely well. It combines, it unifies data storage for files, tables, and streams. Because of its ease of management and because of its scale and performance, it's commonly used for lot large, like massively large datasets used for machine learning and big data analytics. It's also POSIX-compliant file systems, so MapR is using standard APIs for accessing data, and because it's POSIX-compliant, you can pretty much run any application on top of MapR.

Ian Downard: 21:07 MapR snapshots, as I said, they're built into the platform from the ground up, and any application that runs a MapR can benefit from them. Like Snapshots on other enterprise data storage systems, they are immutable. The data that's in the snapshot cannot be changed. It's read-only, and they store only the incremental changes you need to roll back to a point of time.

Ian Downard: 21:32 One of the amazing things about this is you can take a snapshot of a one petabyte cluster in seconds with no additional data storage required, and that's just amazing to see. It really is ... It's a reality. The way these are implemented, so you've got a file and a distributed data storage system like MapR files are written in memory, written to disk using blocks of disk storage, so a file can be made up of multiple blocks, and whenever you copy a file, you've got to copy every single block. That processing complexity like the time it's going to take to copy every block is going to be linear based on the number of the file size and the number of blocks, and the storage complexity is also linear. I'm referring to like Big-O events complexity, and we can do much better than linear complexity for processing and storage of snapshots.

Ian Downard: 22:38 Here's how snapshots work. When you take a snapshot of data, it just creates a pointer to each of the blocks in memory, and it saves that to snapshot. Note, it's not copying any data. It's just reusing the existing storage blocks of snapshots. Whenever the file, the fully read-writable file changes or makes changes to one of the blocks, it will create a new block storage, but the snapshot will maintain its onerous, the old version, so in the case, you can see Block C now has two places in memory.

Ian Downard: 23:19 Whenever another snapshot comes along and it's captured, it then points to that latest version of the Block C that was in the fully read-writable file, and so the thing to note here is there is no data duplication between the file and the snapshots. The complexity of snapshots is really based on the number of changed blocks. This is going to be much faster, require much less storage space than file copies, and the processing speed is roughly logarithmic, so it takes about a second to snapshot a small volume, and it will take like several seconds to snapshot a large one. That's kind of the result of having this logarithmic processing speed.

Ian Downard: 24:17 The way you can create snapshots is really easy. There's a lot of options to do that. You can go into the MapR console, and I'll show what this looks like. You just click create snapshot. There's also command-line utility, so MapR CLI is a command that people can use to administer and interact with MapR platform, and there's a REST API for it as well.

Ian Downard: 24:44 Also, MapR snapshots are not just for files. This is a big differentiator from every other storage system out there. MapR can be used to store files, tables, and streams, and when you take a snapshot of a volume, it backs up, it includes all of the tables and streams that are in that volume. Once you have a snapshot, you can recover those tables and streams with these commands MapR copy table and MapR copy stream.

Ian Downard: 25:18 Also, snapshots are really useful for cases where you have highly changing dataset, so if you've got like real-time data sources that are constantly ingesting data into MapR, but you want to tune sequel queries, maybe optimize them for speed, or maybe you're doing some kind of analytics with another analytical tool like Tableau, you want to be able to control what it is you're analyzing. For A/B testing on sequel queries, you want that underlying data to be unchanging, so you can do that apples-to-apples comparison between the different sequel queries or analytics that you're doing,.

Ian Downard: 26:04 One of the nice things about snapshots is you can access them with Apache Drill or with any other sequel engine. Apache Drill is part of the MapR platform, and as you can see here, you can query the actual snapshot directly. Snapshots are contained inside a hidden directory called dot snapshot, and then the name of the snapshot is something you can specify whenever you create it. I'm going to do a quick demo in my MapR cluster, which I have here. Now, go ... This is the administrative utility that I have. I have just a couple volume created here, so what I'm going to do is create a new volume, and I'm going to add some data to it and create a snapshot just to kind of show you what it looks like.

Ian Downard: 27:04 The commands that are going to run are in this script called snapshot demo, and all of these ... I'm going to play these really quickly, but all these commands are actually going to run on the cluster. I'm using this this utility called do-it-live, which is pretty cool. It just allows me to play a shell script really quickly. When I create a volume, I'm going to use the Mapr CLI volume create command, and that'll just take a second. Then, I'm going to copy a Yelp dataset into the cluster and use that Yelp dataset to import into a MapR Database table. You'll notice I'm using standard Linux commands to do this. This is only possible because MapR is a POSIX-compliant file system, and here, I'm just copying that JSON file, which is about one and a half gigs to my volume, the path to the volume that I just created. I use this import JSON command to import that JSON into MapR Database table.

Ian Downard: 28:12 To create a stream, I can use the Mapr CLI stream create command, and again, I'm specifying my volume for the destination for that stream, and then to write some data to the stream, I'm just going to write five messages, one, two, three, four, five. Then, when I consume them, I've written this Python script, to consume using the copy API messages from my stream. Here, you can see the messages that were in there. Now, the important thing about streams is it's not just message data. The other important thing about streams is that you have cursors, so when applications are reading from a stream, if for some reason the application crashes and restarts, it can use its cursor to continue reading from where it last left off.

Ian Downard: 29:01 You can see my window go up to [inaudible 00:29:04] here, but the committed offset is five, and the producer offset is four. It's basically saying that the consumer, the cursor group, which I've called my group, which I defined in my Python script, is up to date. It's read all of the messages. If I add more messages to my stream and then look at the cursor again, I can see that the cursor is no longer up to date, that there are more messages in the stream, so it's now five messages beyond.

Ian Downard: 29:36 I'm just going to take a snapshot now. We're going to see that all of this data is contained in the snapshot, so when we create snapshots, we can use the volume snapshot create command, and when I just created it, I should be able to see that here's the volume that I created. If I go onto snapshots, I can see the snapshot that I just created, and I can create another one. Now, I've got two snapshots. Now, when I list the snapshots, you can see there's these two properties, owned size and shared size. The first snapshot has 153 megabytes in it, and it's not sharing any data with any prior snapshots because there were none. The second one, I made no changes to any of the file, so all of it, all of the data in that snapshot is shared with the first snapshot. This is illustrating the fact that we're not duplicating data with snapshots.

Ian Downard: 30:47 If I go into my volume and I list the hidden directory, I can see both of my snapshots, and then I can list the files in there, and I can copy them back into the volume like this. I just created another version of the file. That was the Yelp dataset. I can import. I can copy the table over from what was my table to now my table two, and the same with the stream. Then, just to verify the ACLs are the same, so here are all the files that I have in my volume. These were the ones that were taken off from the snapshot. We can see the ACL for these files are all identical, same file owner, same group owner, same access control, and you can use the stat command. Again, because we're working with a POSIX-compliant file system, we have all the access to file metadata, so you can see the file size is going to be the same, and all those ACLs are all the same.

Ian Downard: 32:00 If we do a diff, there are no differences. We can do diff-tables, see that there are no differences between the tables. The metadata to the tables is the same. If we look at the actual output, if there were any differences, they would show up here, but there are no differences, and again, give diff-streams as well. There's nothing in the output of the diff-streams, so we know that they're the same. Now, let's look at the cursors. We'll just consume from our original stream because we've got our five messages, and now, if I consume from the stream that I recovered from the snapshot, we've got the same messages there because the cursor was persistent.

Ian Downard: 32:56 That's the end of the demo. I would like to just conclude by saying one of the best use cases I've seen for MapR snapshot is with Valohai because through their orchestration for version control to keep track of all your experiments, they really can take advantage of MapR snapshots to make the steps where you backup your data and version control the data really quick. If you want to see a demo of how Valohai uses MapR snapshots, we're going to show that now, and we've also recorded it and put it on YouTube, so there's the URL for that.

Ian Downard: 33:41 Some references to learn more about MapR snapshots, so just to wrap up today that MapR provides really first-class support for snapshots. You can use them out of the box. You don't need any special application modifications to use them, and they've been proven. They've been part of the MapR platform for a year. That's the end of my presentation. Juha?

Juha Kiili: 34:16 Yes. I'm sharing my screen. Can someone confirm that they're seeing?

David: 34:27 Not yet, Juha. There we go.

Juha Kiili: 34:41 Okay. Hello, everyone. Yeah. Hello, everyone. My name is Juha, and I'm from Valohai. I'm senior software developer. I'm going to show a really simple demo of Valohai and MapR working together in a machine learning context. First, we're just going to see what is Valohai, so let's create a project Aloha. This is Valohai that I'm hovering here, so we'll create a project. We'll call it Hello World number. What Valohai is, it's a deep learning management platform. In Valohai, we can run training executions, and I'm going to link a GitHub project, so a GitHub repository and the [inaudible 00:35:56]. I have a GitHub repo here,

David: 35:58 Juha? Juha, we lost your screen share. Can you share your screen again?

Juha Kiili: 36:07 Okay, sorry.

David: 36:09 No problem.

Juha Kiili: 36:10 Can you see me again?

David: 36:12 Yeah. We can see it.

Juha Kiili: 36:13 Can you see me? Okay, so this GitHub repo. I have just two files, dummy text file, and then there's the Valohai YAML file. This YAML file just fills the default for the platform. We're going to ... This should be ... We're going to link this Github repo to the Valohai project. There we go, and so the project is now linked to the GitHub repo and we can run our first execution. What you need for execution is a few things. You need a environment, and we have AWS, and we have Azure, and we also have Google. It's not listed here, so we're going to pick one type of instance here. Let's take something cheap, so there we go.

Juha Kiili: 37:28 Something with no GPU. Another thing you need is a docker image, and we're going to use a busy box. It's a really simple image at the Linux, simple Linux stuff. Then, you need code. This is the format from the GitHub we've applied to this project, so we have environment, we have docker image, and we have some code. Finally, you need some command to run so print out "Hello World". Let's do another thing to tune out the contents of text file, so we have our things set up, and we're going to run an execution. This is not really a training, machine learning model yet. It's just an example.

Juha Kiili: 38:22 We can see we started a server instance in AWS. We downloaded the docker image. We started the container. We downloaded the code from repo, and we can see our ... pronounced from the commands. This was our first execution. Let's look at our project. We can see first execution listed here. Let's see another one just for the sake of the example. We run another execution, so when we're finished and ... Here's the list of executions now. We have two. This is our version control power, so you run executions and they get forth all the data to reproduce this execution of sort. Six months later, I can come back. I can see that I ran this along this type of environment. I used this docker image. I executed these commands, the cost, and if I would want to reproduce this, I'll just copy and create another execution. I could tweak some values and figure out like maybe something went wrong, and I want to figure out what happened six months ago. That's version control for you.

Juha Kiili: 40:05 That's what's a simple example of Valohai, so let's look's at this from MapR perspective. Ian had set up a MapR cluster, and I'm going to log in. I'm going to run a script to set up my data, so now, this MapR cluster has MNIST dataset. We're calling it Light MNIST dataset because it doesn't have all the images of 10%. I've also pinned up a MapR project here. If you look at the settings, these are the environment variables, so this is the IP of the cluster. This is the volume in the cluster that we're using. That's where the data is. We also have our username and password here. We're going to actually take this. Again, we can go see.

Juha Kiili: 41:09 Yes, the same tool that I was showing earlier, so our volume, this is the volume we're using. Here are the snapshots, so these snapshots are already there. Now, we are going to run a new execution. Hopefully, we'll see a new snapshot appearing here side by side. Okay, for the execution, what we're going to do is we're going to train a simple model, a simple model. Here is the command we're actually running on server for that script that is in charge of creating the snapshot is using the REST API and telling the clusters that it needs to create a snapshot so that they design this that it's running these executions. It doesn't have to remember to do it. It's always done automatically.

Juha Kiili: 42:09 The other command is the actual training. We also have some hyper-parameters here like any other machine learning experience with apps, this learning rate and how many steps we're going to do, and dropout, et cetera. Okay, so let's run this. Again, we're pulling the docker image, pulling our code, the actual model. You can see the details. Everything is getting version controlled, so later, we know what the environment was, which step we did, which image we used, what were the commands, what were the hyper-parameters.

Juha Kiili: 43:07 It's taking a while because we're pulling it via docker. Let's see, any second now. Okay, here we go. Here, we can see that it's creating MapR snapshot. This is the ID of snapshots. Now, let's look at the MapR UI here. It should be ... Again, it's refreshed. Okay. Here we go. I see that this ID is same that we're creating, so the snapshot was created. They designed this. They didn't have to remember to create a snapshot. It was automatically created. You can see how trainings are already finished, so the accuracy of 93% and also look at the nice little graph here. Now, what we're going to do next is we're going to add some more data. We just use 10% of the MNIST dataset, so now, we're going to run another script.

Juha Kiili: 44:36 We're going to call this a full MNIST dataset. The design is now going to run another training execution. It's going to be the same command, same parameters. Plus, the data, it changed from the MapR cluster. Okay. Again, we can see new snapshots are created. We can validate this from the MapR UI. Here we go. We can see another ID has appeared, and our training is going, and now, we get a little bit more accuracy because we have more data if you want to hit 95. Okay, it's done. You can see that the model is getting saved, so that's also version controlled, of course. The outputs are here. These are the weights and biases of the network.

Juha Kiili: 45:50 Most importantly, these are our executions. The first one must be 93% accuracy, and the second one was 95. Again, if six months have gone by and we come back, let's say this model number 93 is in production and it has done something wrong, maybe it was a self-driving car and seen an accident and we have to figure out, we have to come back, and we have to look from the version control like while they failed, what was the data being used, and what were the parameters, what was the environment, what was the docker image. Everything is version controlled here. The full reproducibility, so let's say I wanted to reproduce this execution six months later, so I just copy ... Actually, I'm going to go and first copy the ID of the data, ID of the snapshot I can reproduce other copy execution,

Juha Kiili: 47:01 What I'm going to do is I'm going to add a new environment variable. That is the ID of snapshot, so this will make sure that the dataset's the same than it was six months ago, and everything else is the same, environment is the same, code is same, the docker image is the same, the command is same, and the data is same. Let's run the execution. We should see 93% accuracy because we're using the old data. The latest data right now in the cluster is the full dataset, which gives us 95% but because we used the old snapshot, we should get a 93.

Juha Kiili: 47:56 There we go. We reached ... Well, it was 94 but its close, so here, we can see the first line of the log is using existing MapR snapshots, so this is the first snapshot we used. We were able to reproduce the whole execution down to every detail. Here, we can see all three or five executions. This is version control. Okay, so that was the demo. Thank you.

David: 48:45 Great. Thank you, Juha. Ian, I know you've been answering a few questions via the chat. There are a couple more questions that I'll ask here. Let's start with one for you, Ian. Are MapR snapshots the same thing as Hadoop snapshots?

Ian Downard: 49:10 Hadoop, I assume that person is asking about the Hadoop file system, HDFS as it's called. HDFS does have snapshotting capability, but it's fundamentally different. Hadoop is implemented differently on MapR. For example, it doesn't have the same NameNode architecture. NameNode is often a bottleneck to how Hadoop scales, and it doesn't ... NameNode is the service that's used for saving metadata. MapR uses a different architecture for a more distributed approach for vetting metadata across the cluster, so it doesn't have that single point failure. The Hadoop implementation on MapR depends on a different file system, so it's not HDFS. It's actually known as the MapR file system, and that's the file system where the snapshot is implemented, the snapshot capability is implemented, so that's why when we say that everything that runs on MapR benefits from snapshots, it's because snapshots are built into the file system.

Ian Downard: 50:24 Really, how MapR snapshots differ from HDFS snapshots is in many ways. HDFS, because of its reliance on NameNode, it can actually cause inconsistencies. For example, if a file is being written to, and then somebody takes a snapshot before that write on the file end, so if the file handle is open and somebody takes a snapshot, that snapshot won't really be complete until their file handle is closed and the save is complete. That means that it's not a true point in time snapshot, whereas with MapR, it really is a point in time snapshot. It doesn't wait for file handles to close on MapR, so consistency is the biggest difference between HDFS snapshots and MapR snapshots. When you take a snapshot on MapR and that snapshot is complete, every node on the cluster, every other cluster that's replicating to this cluster, they all have the same view of the snapshot, whereas with HDFS, it's possible for one person's view of a file and the snapshot to be different from another person's, so all the applications that depend on snapshots and assume them to be consistent, it can't be implemented on top of HDFS snapshots, but they can on MapR, so that's really the biggest difference. I think ... I hope it answered the question.

David: 52:15 Yeah. Thank you. That was good, Ian. Hey, Ian, there's a ... Ian and Juha and Eero, there's a few more questions in the questions tab. Why don't you take a look through those? Ian, since you just answered-

Eero Laaksonen: 52:31 I can answer-

David: 52:31 Go ahead, Juha.

Eero Laaksonen: 52:33 Yeah. I can answer one of them, which was I think a cool question about whether it makes sense to version the docker containers too. The approach that we have taken is that you use some sort of external tool like Docker Hub to version your docker containers, and the thing that we see on the training side of containers is that they don't actually change that often. That's why we decouple the code and the container whereas some people we've seen actually rebuild their containers every single time they train models, so that's a huge overhead when the docker containers actually tend to change just every three to six months, whereas the code changes several times a day at best. The way we've done it is just, yes, we just do version control docker images but they are decoupled from the actual code that changes. It makes minimum overhead during execution time.

David: 53:42 Great. Thanks, Eero. Ian or Juha, is there any ... Do you see anything you want to jump to?

Ian Downard: 53:52 Yeah. I see another one related to that last one. Leopoldo asks, "MapR snapshot involves MapR Database. Does it also take a snapshot for HBase?" MapR Database is a database that that can be used via the HBase API, so if you have HBase, if you have applications that rely on HBase, you can implement them on MapR using MapR Database. Again, since MapR snapshots is a capability that's built into the file system, it applies to all data in the cluster including MapR Database, HBase applications backed by MapR Database, MapR streams (Now called MapR Event Store) and files. Contrast that with HBase snapshots, HBase snapshot also have this consistency problem. They can't rely on HDFS snapshots because HDFS snapshots don't guarantee consistency, and so at the end of the day, HBase snapshots can contain data that one reader might see different data in the snapshot than another.

Ian Downard: 55:12 Are there any licensing implications, is another question, so you do have the license in MapR cluster to get the snapshotting capability. If you want to experiment with it, you can get free licenses, 30-day free licenses from MapR, but it is a licensed feature that's part of the MapR enterprise license. Also a question, is this fully supported-

Ian Downard: 55:36 Go ahead.

Juha Kiili: 55:39 On Valohai side, I guess for licensing too, yeah, you would need a pro or enterprise license on Valohai side too, but you can go and see the website for details.

Ian Downard: 55:54 There's a question, is it fully supported? Yes. I'm just reading through the rest here. There's several questions where they're asking about what if a file changes what happens to the data in the snapshot? Again, the snapshots are immutable. That means they're read-only. If you change a file, read, write, fully read writable file, it doesn't affect the snapshots. The only way to back up that data is create another snapshot, so like if you have a MapR Database table in a snapshot, you can read from it. If you have a file in a snapshot, you can read from it, but you can't write changes to the snapshot.

Ian Downard: 56:44 What else? If you make changes to a file ... There's a question here. If data is overwritten, in other words, if a file is changed and a snapshot only points to block, how can it save the information from a certain time if the blocks have been overwritten? The answer there is when you take a snapshot, it just creates a pointer to a block at a certain time. If the file, the read-writable file is changed, it duplicates that block, copies that block into another place in memory, and then it writes to that, but the original block is unchanged.

David: 57:39 Maybe one more, Ian, if you want to hit one more. Did you hit them all?

Ian Downard: 57:45 Last one here. It's being replicated or it's moving when taking a snapshot. How can we ensure that the data is consistent? MapR snapshots are guaranteed to be consistent, and I think that is not something you need to test against. It's a guarantee as part of MapR. I think that we've answered all of them.

Juha Kiili: 58:18 I can extend on one of the questions. There was a session about the saving of the models. If you're using the combination of MapR and Valohai, we automatically save the models and versions and apply them into the actual executions and data automatically, so they don't have to be saved in snapshots. They're saved to a possible whatever data storage you configure. It can be MapR, for instance. It will be saved automatically when using Valohai, so you don't have to manually manage that in any kind of way.

David: 58:56 Great. Thank you, guys, so that is all the time we do have, so thank you Eero, Juha, and Ian, and thank you everyone for joining us. That is all the time we have for today. For more information on this topic and others, please visit mapr.com/resources. We will be sending out a follow-up email with a link to the recording and slides as well as some additional resources shortly after the event. Thank you again, and have a great rest of your day.