Data Management for AI and Machine Learning: Putting Dataware to Work


Ellen Friedman PhD

Principal Technologist, MapR Technologies

When it comes to building a successful AI or machine learning system, data is as important as algorithm or model code. And it isn’t just volume of data or data quality that matters - although both are very important. You also need efficient ways to manage data at scale, particularly for the special needs of machine learning, such as data versioning for training models, a reliable event-by-event history, and a way to archive exactly the data seen by a model in production.

Please join me to talk about these key aspects of the logistics of AI and machine learning systems and how you can make data management much easier and more reliable.


Stephen: 00:09 Hello and thank you for joining us today for our webinar. Data Management for AI and Machine Learning: Putting Dataware to Work featuring Ellen Friedman, Principal Technologist at MapR Technologies. Our event today will run approximately 30 minutes with the last 5 to 10 minutes dedicated to addressing any questions you may have. You can submit a question at anytime throughout the presentation via the chat box in the lower, left hand corner of your browser.

Stephen: 00:35 With that, I'd like to pass the ball over to Ellen to get us started. Ellen, it's all yours.

Ellen Friedman: 00:40 Hello, good morning, afternoon, evening, depending on where you are in the world. Thank you so much for joining us today. First, I'm just going to provide a little bit of contact information. As Stephen said, I work as Principal Technologist here at MapR. I'm also a committer on two Apache open source projects. And an author writing mainly for O'Reilly Media. I'll repeat this contact information at the end.

Ellen Friedman: 01:09 Now, I wanna start right off thinking about what really matters for data and code for AI and machine learning. I want you to think about the fact, not the way people usually think about this, but machine learning is, in many ways, just a new way to program. So, let's look specifically at what happens for data and code. What matters, and by what matters, I mean what makes a difference. What's gonna make these projects successful.

Ellen Friedman: 01:38 Well, the first thing to keep in mind is that the machine learning code part, the part that people start off thinking about, is actually an incredibly small part of the overall picture of what needs to be done. Along with this, this was just copied off of a paper by Scully et al. This is from Google. Hidden Technical Debt in Machine Learning Systems. And in this you can see the tiny red square in the middle. If you're not color blind, you can see it's a red square that says machine learning. That's really the part that people think of as the whole system. In fact, it is a really small part of what needs to be done. You see things here like data collection, feature extraction, data verification, analysis tools, all sorts of things. Monitoring. All of this needs to work correctly in order for a machine learning system, or an AI system, to be more of a basically a science project. And the good news is, the place where there's the most variability is in that little red box of machine learning, the code, the algorithms. A lot of the skills that need to be done and done well in all of these other boxes have some commonality across different machine learning projects, and that makes it much easier to do.

Ellen Friedman: 02:57 Now another way to look at this, I ran across this blog from Josh Cogan, also an engineer at Google. And what he looked at is the difference in expectations, what people think is involved in these projects and the reality. If you take a quick look at this, again this is all color-coded here. What people expect, he found in surveys, is that the big area, the green area, the long stripe there, they think that you spend most of your efforts optimizing the machine learning algorithms. In actual reality, the two areas that are the largest combined areas, the blue and the orange area, are collecting data and building infrastructure. That's where most of the effort goes in. And some aspects of that are my topic today.

Ellen Friedman: 03:48 Now, machine learning and AI are only as good as the data they use. And that may seem in no way surprising, even for people who haven't previously worked with machine learning as long as they are used to the idea of, in any program, garbage in, garbage out. And that's no different here. What is different? It's that in a traditional software program, garbage in, garbage out doesn't change the program itself. It doesn't change the code. But this is entirely different with machine learning programs. You're using data. Data is part of the process of what actually produces the final model, the final script, the code. It's actually interacting with code. Data is part of the process, and so the quality of the data that's going in that's interacting with your initial script is going to change what the final result is. That's quite different than what happens in traditional software programs, but a lot of the skills that are involved are the same.

Ellen Friedman: 04:52 Now, let's take just a look as a kind of reminder for those of your if you are not familiar with machine learning, a few of the basics. We'll just take a moment here. I just want to point out that you really have two completely separate parts of the process. One is training the model itself, and in this case you usually use more data, we're saying here customer data. The data sets that are used for training, generally, are larger than the data sets that are actually going to be used when you actually run the model, although they'll be of a similar type.

Ellen Friedman: 05:25 The other thing I want you to keep in mind is that machine learning is a highly iterative process, regardless of the tool you're using, and I mean this is true almost without exception across machine learning and AI systems. Highly iterative, both in the development and if you are training a model, but even after you deploy models, you continue to monitor it, to see if the model is holding up in terms of the performance that you need and, periodically, go back and either train a new model, retrain a model. And this happens because, obviously anything can break, but more importantly, these are fluid. They're interactive. They're interacting with the world, and the world doesn't stay static. The world will change. There will be need for new models or updates or adjustments. This is different again as kind of a traditional sense of write a program, don't think there's one model. There's many models. This is not write one, it's done, and then run. Nothing like that. There's always a lot of this in motion at any given time. We're gonna look at the ways you can handle that and make it really effective.

Ellen Friedman: 06:37 Now all of you, if you are used to doing traditional programming, there's nothing new about the idea that you're doing versioning with regards to the code. You may have multiple people working on a project. You're going to have different branches. You're going to merge those branches. You need to track those different versions, and this can be done in a number of ways. Git is a very powerful tool. It's a way to keep track of versions of code, whenever you are writing software, but the fact is that, in the case of machine learning, the data needs version control as well for the reasons that we said. And one way to think about this is that, not only are there these iterative aspects of the project, but suppose you need to go back and reconstruct what you've done. In the case of traditional programs, you could reconstruct that program if you kept track of your versions without access to the data, without regard to the data that the program is reading, but that is not true in machine learning. You can't reconstruct how you've built a model, how [inaudible 00:07:47] it is without also knowing exactly what data was used and how that data was produced. So there's a whole different side of things that also needs to be tracked in a very effective way.

Ellen Friedman: 08:01 You know, I used to work as a research scientist. I was a biochemist and molecular biologist. I did laboratory science. And this way of keeping track of things. We kept a lab notebook, you didn't just produce an end result. You didn't just walk up with a test tube for whatever it was that you were building or synthesizing. You kept detailed notes of how you got there. And, an incident of a woman I worked with. She had gone onto another lab, and she called me one day, distraught. She was doing a very simple procedure. She'd done it over and over again. She was very good at what she did. She was a technician, and she just couldn't get a decent result. She was thinking of quitting her job. I said go back to your lab notes, look through every step, something has to have changed because I know you're careful. And it turned out to be kind of a goofy thing. She went back and found that one of the chemicals that she was using had come from a different lot, from a different supplier. It actually had a contaminant. It was nothing she was doing wrong, but there's really no way she could have tracked that down if she hadn't kept very clean notes about exactly which log number, which group she used. That's the equivalent of what you are doing with code versioning or data versioning.

Ellen Friedman: 09:20 And so it can really pay off in the end when you try to go back and reconstruct what you've done and somebody else wants to collaborate with you and work on a similar procedure or you yourself want to come back six months or a year later and reconstruct what you've done.

Ellen Friedman: 09:36 So that's why you need to do this. Let's take a little more of a look at what it is that you actually want to track and how to make that easy to do. Now, let's think just for a moment about basically a toy example, a toy project where you're doing a predictive analysis. This is for traffic to a Wikipedia site, so traffic of looking at how many hits there are of people looking up something about Christmas, just so we have data to look at.

Ellen Friedman: 10:10 This is actual data, so this model was built to predict the traffic over this period of time. The darker line here, this black line is the actual traffic. The gray line shows the prediction, and you can see that the models predicting the load of traffic across this, from December 17 to the end of the month. From December 17, 18, 19 and so through the 21st, even through the 23rd, it was doing a remarkably good job at predictions. The predictions actually matched beautifully, but as we move forward to the 24th, the 25th it begins to be off a little bit. Now, in terms of prediction, this is actually pretty impressive. This might be good enough if it's for whatever purpose somebody would be building a predictive model if they were getting results like this. That might be good enough, but they also might be curious why they weren't able to hit it exactly right as they came into the holiday, but the only way they could be able to make those changes if they want to rebuild their system, is to go back to not just the scripts that were used to produce the model, but the data that was used for training the model.

Ellen Friedman: 11:27 And one question about the training data is how was it prepared? So we just have a tiny snippet here of training data, again using our toy model. What we notice is that this is trained over, we're just showing a few hours here, but this looks like it was trained for data in the month of November, and so thinking about it in an incredibly simple way. You might think, well, November is not good enough data to train for December. But what are your choices? Maybe that's the best you can do. Maybe November was a good choice because it's leading up to December. People are already thinking about the holiday. Maybe you need to train on data from a previous year in December. The point isn't this toy model. The point is you can't begin to pick apart what you're going to do or how you want to change it unless you can actually go back and understand exactly what data was used. And, in addition to understanding the training data as that body of data exists, you also need to know the process that was used to extract features and to prepare those features as training data as it came from the raw model.

Ellen Friedman: 12:41 In this case, it looks like we can see target variables, we can see different predictor variables. The kinds of things people do during data extraction is to decide what features may be useful, and may include all of the features in a training data set and then, with different models or different model runs, have the model actually select only a subset of those features. In this case, the data was very simple, but coming from the original data to produce the training data, only certain days are selected. The predictor variables are actually displayed differently. They are all displayed right together. There are a number of different changes. You might be bringing in data from multiple data sets. You might have real time data coming from an IOT sensor, you might be combining that with some kind of a history of parts, maintenance, customer history, whatever it is. There is a number of ways to get from the why range of raw data up to what you're actually going to use as training data for your model, whatever it is that you've done to produce that particular batch of training data, you need to track that, record that in some way. You need to record the process itself.

Ellen Friedman: 13:51 You also want to be able to freeze a copy of the data, the training data that was used so you can go directly back to it. Someone else might come in and modify that data when you come back to it later. That would be a problem. So that's two of the things that you definitely want to be able to track in just producing the training data. You want to also be able to keep, to freeze a point in time view of the raw data. You want that for two different reasons. You might want to change your data preparation process. You need to get back to the same raw data. You may need to go back and find out, like my friend if there was something wrong with the lot number, something wrong with the original data, something suspect. More than often as not, what happens is that people go back to raw data because they recognize that some of the features that were basically discarded in producing the training data were actually useful for a different project and are very thoughtful. So raw data is also a valuable thing to preserve, to freeze to be able to get back to.

Ellen Friedman: 14:56 Now, the process that we've talked about, and this is a grossly simplified diagram of a really complicated process, but the pieces that I want to call your attention to, as I mentioned, raw data I've underline here is one of the things that you want to not only describe what you've used, but actually be able to get back to an exact copy of raw data that you used. Data preparation process, I've just put it a big gray block here. That can be multiple steps and different for each project. You're doing that data preparation process to produce clean data, and here I've suggested for multiple runs, and you might have a totally different training data set for a different set of runs, where maybe you're using a different type of model.

Ellen Friedman: 15:45 One thing that is standardly done with training data, this is another thing that you have to manage well when you're working with data is that you'll pull out some part of that training data set as test data, and by held out, I mean it's not used to actually modify or train the model. This 95%/5% division that I put is just arbitrary, but it is commonly used. It can be whatever number fits, but you do hold that out, and the data that's held out as test data needs to be randomly selected so that's usually done with a number, a seed that's used with a process for random selection of data. That's another thing you need to track is how was that selection of the small percent held out for training, excuse me for test data, how was that actually produced. Now, there's an additional step in here, and we'll talk about it in future webinars, but another step that's often done is what's called cross-validation, and you are actually taking that 95% that is your training data and subdividing it into multiple parts and pulling out part of that so that you keep doing internal loops of validation.

Ellen Friedman: 16:59 We'll talk about that another time, but you see that there are multiple steps here where a combination of some accounting of how you produced data that you have, some track record of where it came from, and also copies of the data itself. I've underlined training data. I've underlined raw data, just as a reminder that you not only need to know how they were produced, you need to actually be able to get back copies of that as they existed at a certain point in time.

Ellen Friedman: 17:30 Now, the gray square on the right, the learning process, is standing in for the process where you actually have a program, a script for untrained model, you have it interacting in multiple loops over a certain period of time with training data. The model is actually changing in response to the data, and you may go back and start over again, change your various knobs, various settings or parameters in that code, in the program for the model, and try it over again. You keep going through this if you find anomalies in the data, you may need to go back and do an additional step of cleaning data and starting it over.

Ellen Friedman: 18:14 At some point, you, in valuation at each step, you begin to see what looks like a model that, with this type of data is performing in a range that you find acceptable. That's your trained model, that's what you're going to deploy, maybe deploy into production, deploy off to another team for further experimentation. It's the model that has emerged from this process, and what you need to be able to track is for any trained model, and there will be hundreds of them, what was the process that got you there?

Ellen Friedman: 18:49 Now that's a lot to deal with, but it is important to keep in mind. As we said that the data is part of the mix that produces the trained model, so you want to be able to track and version the data in the same way that you version code. We said people very familiar with versioning code using Git as a tool for being able to do that, there are a number of different ways to be able to version data for machine learning, and the one that I want to point out to you because it's very convenient is a way that you can do this without actually having to copy data. This is on the MapR data platform. MapR is a large distributed system that provides dataware that has a number of capabilities built in that allow these things to be done in a, basically, natively done. And, in this case I think a really useful capability is MapR snapshots. Snapshots are actually based on a data management feature, called a volume, a MapR volume.

Ellen Friedman: 19:58 So we'll just takes a moment to see what a volume is. MapR volumes are, basically act like a directory with super powers. They expand the cluster, you can have multiple volumes in a cluster. They are a great way to be able to have true multi-tenancy on a cluster. They are the basis for mirroring data. They are also the basis for snapshots, and, as I said, a real advantage for multi-tenancy. Also, just a quick reminder, with MapR dataware, this is actually looking at a home directory in a volume, you have multiple data structures, multiple data structures within that same volume, all parts of the same system. These are not things going through connectors, and so here you see files, tables, streams, directories, all within this one volume so that can all be, you can have a snapshot of all of that together, you can mirror it to another system, in other words, actually make a copy of it very, very conveniently.

Ellen Friedman: 21:01 So back to the snapshots, the basic idea is for a particular volume of the system, you might produce a snapshot, and the advantage of the snapshot, especially for data versioning for machine learning, is that the snapshot is giving you a point in time version of the data, but it's doing that very inexpensively. Inexpensive in terms of storage costs. Inexpensive in terms of effort. It is not a copy, it's actually pointing back to original data, but it's frozen. It's not a leaky snapshot. It's a real snapshot. You can easily go back to this. The snapshots have a path name, they are very easy to find and work with. They can be created manually or on a schedule. They also can go away manually or on a schedule. It is a very convenient way to work with data and to be able to control who has access and who does not.

Ellen Friedman: 21:57 This is just a tiny fragment of the kinds of things that you'll want to track, but I'll just remind you, for training data, you'll want to be able to keep track of the path name of the raw data snapshot, so you can actually go back to specific raw data. You want to keep track of the Git reference for the data preparation process. Another is the feature extraction so you know how you came up with that data. Of course you would then also have a snapshot of the training data.

Ellen Friedman: 22:32 And, over on the right, you see for the code side for the delivered model, you would be tracking in something like Git, GitHub, the model itself, the script, but you also need the path name of the training data snapshot because remember the training data and the learning script together are what produced the model, and so you can have a pointer back to the snapshot that contains the training data for that run. A Git reference for the learning script itself. I remind you that might include the random number seed for generating held out data, whatever parameter settings you used for that particular learning process of the learning code itself. And there could be other things that you want to include, but this is just a sample typical of the kinds of things that you need to include for data management that goes in conjunction with code and software versioning so that you can actually understand, be able to track and get back to what you've built in machine learning systems.

Ellen Friedman: 23:36 Now, data management for machine learning, like any other data management, requires fine-grained control over who does and does not have access. This is another aspect of the MapR volumes that's very convenient. MapR dataware lets you easily manage who has access to the data and those volumes. There are access control expressions, that together with the universal path names makes it easy to control. There is this whiteboard walk through video by Ted Dunning. There's also a blog, a written blog attached to that, and we'll repeat all of these additional references and resources at the end, and we'll be emailing those out to you so you don't need to worry about taking them down now. If you want to see a little more about how volumes work and how the access controls for volumes work, this is a short little video, a blog, I think it makes it very clear.

Ellen Friedman: 24:33 Now, we've talked about things that people have to do. Stephen do we have a hard stop? I've gone a little slow today. Can we go over a little bit?

Stephen: 24:41 We can go over.

Ellen Friedman: 24:42 All right. I'm gonna plow on.

Stephen: 24:44 Okay.

Ellen Friedman: 24:46 We've talked about things that you need to do, and that's a lot. Now the fun part, these are things you don't need to do. One thing you shouldn't have to do when you're working with a machine learning system and dealing with data for the system. You shouldn't have to copy all of that data out of a data storage system, a file system, just so that you can process it. So I just remind you, for example, that when you are doing data prep, here in my diagram, I've shown that being done maybe with Spark or Hive or some other tool like that, those big data tools that are very common often write out in the form of HDFS, the Hadoop Distributed File System.

Ellen Friedman: 25:27 But if you store this on an actual Hadoop system, then you have to copy the data back out in order for machine learning tools to read it, and there's a huge range of machine learning tools or even traditional things like Java and Python that are often used as part of machine learning. They generally read in POSIX. The nice thing about MapR dataware is that it's a real file system, fully read/write. You can store data that's being written out as HDFS, but it can now be read directly by AI machine learning tools that are basically reading out in PosX. You don't have to have a separate cluster, you don't have to have a separate step for reading out. You don't have to have a sort of capabilities that are built into some of these big data science offerings or tools, where if you look at a lot of the capability are things that just aren't needed because there are these unnecessary steps that are based on copying in and out because you're storing data as HDFS. So MapR makes this really simple and does it all in one system.

Ellen Friedman: 26:37 One reason this is very important. Look at that icon up at the top, I've included, people use a large range of different machine learning tools. We've talked about that before. People who are used to doing this and very successful with it usually keep four, five, six favorite tools that are working well because no single tool fits every situation so I've put in a few favorites here. H2O is phenomenally good software and tools for machine learning, TensorFlow and so forth. But I've added this Stanford NLP was just released week before last. The point is whatever tools you're using today, you're gonna want to use something else coming up because new things are constantly being released. So, when you have a good foundation system like this dataware that gives you the ultimate flexibility for what tools you use, you can adapt to that and experiment with new tools without having to rebuild your whole system. You don't want to be locked into a data management and data storage system or tool that maybe does feature stores but it's just locked into certain machine learning tools because then you don't have flexibility.

Ellen Friedman: 27:51 We said that data depends on efficient data access, both for the convenience so you don't drive yourself crazy as you're developing these systems, but also so that it's fast time to value. Another aspect of that kind of flexibility and speeding up time to value is to be able to explore data back to this idea of raw data. Which features are you gonna want to extract? What should be the data preparation process? That becomes much shorter if you use a tool like Apache Drill. Apache Drill is a highly distributed, highly scalable SQL query engine, it's standard SQL, but it's also unusual in that it does schema discovery, and that means you can shorten the prep time. You don't have these days and weeks and months of ETL before you actually run a query. You can do that basically in minutes to hours. And that lets you explore data that, in a system like a machine learning system. That's really important as you get started, you're using many different data sources and a starting point is just to be able to explore data and really figure out what that data is.

Ellen Friedman: 29:05 And sometimes just steps of exploring data show that you can actually build the solution you want, sometimes you don't even need machine learning. Sometimes a very simple program turns out to be sufficient once you actually understand what's in the data. So I highly recommend keeping and a look a this open source Apache Drill, which is unusual in its flexibility and capabilities for data exploration.

Ellen Friedman: 29:31 Now another thing you shouldn't have to do is you shouldn't have to worry about where your data lives. Suppose you work at one place and data lives in another. Sometimes you want to run your programs right next to data. Sometimes you want to copy the data back to where you are. You need a system that can do either of those. You should, however, be able to run your systems from data that is remote without having to copy anything back and forth, and that's what a system like MapR dataware lets you do. Here, using an example of data that's coming in via a stream, so a stream transport, you have a data source, you have an application, a consumer. You probably have multiple consumers. That stream transport, in the case of MapR, uses the open source Apache Kafka API. Its a MapR stream again, it's part of the same system, the same code. It's all one technology in MapR so that's included in our volume, but it has a special capability that really goes beyond what Apache Kafka can do. It has very efficient stream replication, you have multi-master bidirectional stream replication, just as MapR has multi-master bidirectional table replication.

Ellen Friedman: 30:52 And this means that, as a person who's building applications, that the data scientist as somebody who's building machine learning models, you can access data source that's where you are or that's halfway around the world. You can access data in the cloud. You can access data on the premises. You can access data that's coming in from some edge source, you can also deploy your model out to do some processing or some analysis or some learning right at the data source. All of those should be easy and open to you, and things that you don't have to program in at the application level. These are things that can be handled natively and seamlessly, and essentially almost invisibly, so that it separates your concerns to focus on the models and the systems you're building. You shouldn't have to worry about these other things. They should be handled at the platform level to be really efficient.

Ellen Friedman: 31:53 So that's a larger lesson, not just for stream replication, but look for dataware that can handle a lot of the aspects of logistics natively instead of having to program all of those in at the application level. As mentioned, MapR converts data platform as a dataware that can do this very easily, and we'd love to talk to you more about that if you have questions about MapR itself, please ask questions or feel free to contact us.

Ellen Friedman: 32:24 These systems are built to do this from edge computing, on premises, in cloud, multi-cloud, basically to be a dataware that lets you build, essentially a data fabric, so that you just work across whatever system you want. Increasingly, people are doing their work with large scale applications in predictable environment and very flexibly by using containerized applications.

Ellen Friedman: 32:55 Just to remind you, in order to do that, you do need some system, some framework to orchestrate those containerized applications, their resources, to be able to identify them, to look at where they're deployed. Kubernetes is emerging as a leader in being able to orchestrate the applications. In parallel, you need dataware that can actually store, persist state from those applications. In the case of MapR you can do that not only as files, but as streams or tables, but you need to be able to persist states so that you are not limited to stateless applications that are being run in containers. And this is a broad issue. It is not specific to machine learning, but I mention it because being able to deploy the huge number of models that you work with in machine learning via containers is an increasingly popular way to do it. And, again, people raise the question of whether they're doing this on premises or cloud or in a hybrid sort of architecture. They may be using cloud exclusively, they may be using it to burst cloud for heavy computational loads, maybe during training per example. People increasingly need to work across multiple situations.

Ellen Friedman: 34:15 A real company, a manufacturing company, we have a number of different customers who fit this same pattern. They have huge amounts of data that are being produced IOT data out at edge, they do some edge computing. They have a private cloud on premises, they are making use of multiple public clouds, not only because they don't want a vendor lock in to a cloud provider but also because they are using different specific services from different clouds. And by doing all of this on the MapR data platform, this dataware, with it's capability for platform level data replication, and this global namespace, universal path name, lets them do this basically seamlessly across all of this with a uniform computing environment. It makes the data portable. Literally, they are seeing all of this like one system, and they have the ultimate flexibility to use these services, or not, as they like without having to rebuild or duplicate their own system.

Ellen Friedman: 35:20 This becomes increasing important for machine learning where you are dealing with so many different models. A final idea here in terms of data management is often in machine learning, you need to have a long-term history, an event-by-event history. That might be part of the raw data that you are using in data preparation to produce your training data. If you have a system such as Apache Kafka or MapR streams (Now called MapR Event Store) for transport, which uses the Kafka API, these systems can be really useful for this because you can set the time to live. You have both persistence and performance. You can set how long those messages live. In the case of MapR streams, MapR is actually built into the system, so each topic, each partition within the topic is distributed across an entire cluster, so unlike Kafka, you can set the time to live and have it be practical to do this, not for just a few hours or a few days, but you could do it for months, years, or essentially infinity. That gives you a long term event-by-event history, and that often is a really useful part of the raw data that is used, depending on the machine learning project.

Ellen Friedman: 36:36 This is a quick reminder that people are developing a number of different frameworks for managing data and for managing many, many machine learning models that you have as they're deployed into production. So if this is a deduction level system. This particular one is based on a streaming microservices approach, and it's called the Rendezvous Architecture. I wrote about this with MapR's Chief Application Architect, Ted Dunning, in a short publication with the O'Reilly Media called Machine Learning Logistics. MapR makes that available for free, and we'll give you a link to a copy of that at the end.

Ellen Friedman: 37:13 I just want to point out, within this system, where you are deploying models, you are using streams, MapR streams (Now called MapR Event Store) as the lightweight connectors between services, that the key part of this to draw your attention to when we think about data management is up there at the top. You see something called Decoy. These are models, and so you have Models 1, 2, 3, up to 100, whatever you have. You have basically a program, a script, a trained script, called the Decoy Model, it looks and works just like one of the models you are actually working with it, but it doesn't actually do anything with the data, it just archives it, and why that's important is it looks like a service. It says server here, it should say service, It looks a like service, but it just archives the inputs.

Ellen Friedman: 38:07 And that's useful because you can go back and say, in addition to having snapshots of raw data, snapshots of training data. In this case, you can actually say at this point in time, this is exactly the data that this model saw in production, and especially if a problem arises or something. For forensics, that may be a really useful thing to have. Now this is a pretty safe system to do in a streaming microservices environment because you have good isolation. It might not be a great idea to do if you're building your system and you're deploying it on a different file system where it could interfere with what's happening in production.

Ellen Friedman: 38:47 We have a number of resources. I'm gonna whip through these really quickly, just to let you know what's available. But these will all be emailed to you, all of these links. First I'm gonna look at five short books written with Ted Dunning. These are all published by O'Reilly. One is Streaming Architecture, which talks about how to build a streaming microservices system. Machine Learning Logistics, which is an example of that kind of architecture now applied to machine learning. Our most recent book, AI and Analytics in Production, talk about a lot of what goes into making a successful large scale production system, whether it's the data, the organization of people, how they use data ops, and so forth. It goes into every aspect of where we see people having good habits that are taking them into production successfully. We have two older books that talk about very simple but incredibly powerful ways to build really foundational aspects of machine learning. The first one is how to build a simple recommendations system. The other is how to build an anomaly detection system. These are powerful techniques that are being used by a number of companies in production.

Ellen Friedman: 40:04 These additional resources are new. Carol McDonald with Ian Downard have a written a book recently that MapR makes available about working with Apache Spark. A new one is a buyer's guide to AI Machine Learning. I had some contributions in that along with data scientists like Joe Blue and a number of other people. Sam Charrington just published this book on Kubernetes for Machine Learning, Deep Learning and AI. All of these books are available via MapR for free. Here's a link to a number of resources to excellent blogs by Ian Downard that talk about data preparation and data management in various machine learning systems. I've already made reference to a whiteboard walk through video and blog about how to get access to files, tables, and streams and how to do fine grain control of those systems using MapR volumes.

Ellen Friedman: 40:58 And this last reference is to my last webinar How to Get Value from AI and Machine Learning. Finally, MapR offers on demand training for free, and then I have a new course called Introduction to Artificial Intelligence and Machine Learning that you may enjoy. Please help to continue to support women in technology. This is not just good for women, it's also good for society. And from one woman in high tech, I say thank you very much. This is my contact information, and we've run over, but if we wanna take a couple of questions, should we go ahead with that?

Stephen: 41:35 Yes, thank you Ellen. We did get a couple of good questions. Yeah, we are over the time, but we will go through some of these questions now. Feel free to stick around if you do have the time. So the first question is-

Ellen Friedman: 41:48 Oh, and just a quick reminder, I'm sorry Stephen. If you have submitted questions or you have additional questions if we don't get to them right now, we still will answer those for you via email.

Stephen: 41:59 That is correct. Okay, so the first question is, is anybody actually doing this?

Ellen Friedman: 42:05 That's a great question. And, yes, absolutely. These systems ... and by this, I assume you mean this whole process of data versioning. But, yes, absolutely, and there is more than one way to do it, but people are also using specifically snapshots, some of our customers are using snapshots as part of the MapR dataware system for doing this. It's not the only reason you use snapshots, but it is a very convenient way to handle the very large burden of tracking data in big machine learning systems.

Stephen: 42:43 Awesome. Okay, we have a second question here. I have been deploying microservices for non-machine learning systems. Is data versioning needed?

Ellen Friedman: 42:53 That's an interesting question, and it might be a useful thing to do in any system. I think being able to go back and reconstruct what you've done, but in the most basic way, no. I think this type of data versioning is not essential for other systems. It is essential for effective machine learning. People may be doing machine learning without doing data versioning, but they'll be in a world of hurt very quickly when they try to go back and revisit or reconstruct what they've done.

Ellen Friedman: 43:27 And, again, just a reminder, it's not a matter of just keeping track of what you've done, but you need a way to actually preserve the data as it was used at a particular point in the process, the data that was used for training. The raw data that was used to produce that particular training data. Remember that raw data is going to change over time. New data is coming in. Somebody goes back and makes an adjustment to data, and you can't control all of that because these are big systems being shared with other people. So, this type of data versioning is particularly needed in machine learning. That's why snapshots are such a good way to do it because at a very low cost and very low effort, they provide you a way to have a point in time version without having to copy, a point in time version of that data so that you can rebuild the system.

Ellen Friedman: 44:23 And remember again, the bit of a different idea if you haven't done machine learning. While machine learning is kind of a new form of coding. At the same time traditional programs can be reconstructed without the data. The program isn't changed by the data it reads. With regard to training a model, this model, the script, the program, the model is actually modified by continual exposure to the training data. So the data and the script together are part of what produces that trained model. That's why data versioning is absolutely essential for a really effective machine learning system.

Stephen: 45:03 Thank you, Ellen. Okay, so the third question. Does this approach work with machine learning tools such as TensorFlow?

Ellen Friedman: 45:11 Absolutely, and TensorFlow is a wonderful tool, and very popular. But keep in mind these systems that we're talking about are foundational. And that's part of why dataware is so essential because it's providing capabilities that are required for all of the systems, but it cuts across whatever different system, whatever different machine learning tool, whatever algorithm you're using, you still can use these same methods.

Ellen Friedman: 45:41 By the way, that is also really nice for people who are coming into working in machine learning systems who themselves may not be data scientists. They may not be the person who is actually building the algorithm or tweaking the model, the script, the parameters, those knobs and adjusting the machinery piece. Remember how small that little red dot was in the larger scheme. But the people who are coming in with data engineering skills are ones who are helping to build those pipelines for data to build the training data, extract features, and so forth. And those are skills that are really highly applicable across all these different systems, even for very different kinds of machine learning tools or different algorithms.

Stephen: 46:30 Thank you. And the last question we have time for today is could I just make copies of my data for versioning?

Ellen Friedman: 46:38 Absolutely, you don't have to use snapshot. There are a number of different ways to do this but that may become difficult. It may not be very practical because, in some cases these data sets are very large, and also as we said, you may have many different versions, you may have many different models, you may have many different runs so you certainly could do it by making copies, but that becomes somewhat burdensome and so the snapshot just makes all of this really feasible and much less expensive in terms of either effort or storage usage.

Stephen: 47:16 Okay awesome. Thank you Ellen and thank you, everyone, for joining us today. That is all we have time for. I just want to repeat that we will be sending out a recording of this webinar and the slides, including the links at the end of the slide deck, for your reference. You can also find more information at Thank you again and have a great rest of your day.