Why Evaluation of Machine Learning Is Harder Than You Think and What to Do About That


Ellen Friedman PhD.

Principal Technologist, MapR Technologies

Evaluation of machine learning models and the impact of these learning projects on serious business goals is difficult, but without valid evaluation, how can data scientists build and improve effective learning systems? And how can business leaders, product managers, architects, data engineers and executives who need to make decisions about machine learning do that well if they have no understanding of how to evaluate the success of a machine learning system?

There are fundamental approaches that can improve the quality of machine learning evaluation, and they are important for data scientists and non-data scientists to understand. They also aren’t what you learned in a machine learning class or Kaggle contest. Join Ellen Friedman, Principal Technologist at MapR, for an on-demand webinar on how to do better evaluation of machine learning models and projects.

Without diving into scary math and all the technical details, Ellen will explain on a conceptual but still sound level why evaluation is challenging and will describe some key approaches to address these problems. These approaches will be grounded in real world examples across different industries.

This webinar also serves as a useful prelude to a more technical, in-depth presentation on online evaluation of machine learning models that data scientist Ted Dunning, CTO MapR, will deliver in a few weeks.


Ellen: 00:00:08 Hello, and good morning or good afternoon or evening depending where the audience is. Thank you so much for joining us today. I thought twice about naming our Webinar talk about something that's hard. I was hoping that wasn't discouraging, but the valuation of machine learning projects, AI projects actually is difficult. And I think the first step to make it easier is to recognize that it's not a simple thing and put it in the right context. So I'm just going to give you a few conceptual tips today about that.

Ellen: 00:00:48 And then, I think probably in May or shortly thereafter, there's going to be another Webinar by Ted Dunning and he's going to do a deep dive into the technical side of machine learning evaluation. So with that in mind, let's dive in. This is my contact information. I'm Principal Technologist here with MapR and also Committer Apache Drill Mahout projects. I've written a number of short books for O'Reilly and I'll repeat this contact information at the end of the talk.

Ellen: 00:01:19 So question today is how do you evaluate your machine learning systems? Or to put it another way, how can you tell if your system is a winner? Well, it turns out it's not a simple question. It's not a simple report card. This isn't a matter of just running a model, doing a task like running a race and if you won the race or measure your time, it's much more complicated than that. Let's take two very, very simple examples. Compare these two different machine learning situations. Suppose you want to predict the weather or you're basically trying to predict people's preference. Now in terms of the types of modeling and so forth that you do, these are very different situations, but there are other differences as well in terms of how you think about evaluations.

Ellen: 00:02:08 In the case of predicting the weather, it's not that, that's simple to model. But it does have one aspect of simplicity is that there will be a specific outcome if you predict some aspect of the weather for tomorrow or next week, but the week after, you'll know how accurate your prediction was. If you wait a few days, you'll find out how well your model was performing. And you can do the same thing by testing against historical data and so forth. But in the case of predicting People's preferences, it's much harder to know how, if the model actually got it right. You can test to see if a model is running, but is it accurately predicting people's preferences? And part of the question is the complication of the question itself.

Ellen: 00:02:55 Say, you're doing this for an online shopping site, which people you're trying to predict preferences for, you have a mixed audience. Of course you've tried to personalize it, but can you tell by the results if people's behavior was directly a result of your recommender model? Can you actually tell what the impact is? So, there certainly are ways to evaluate this, but it's not as simple. There isn't just a single defined answer. And these situations become much more complicated and other types of machine learning and recommendations.

Ellen: 00:03:34 There's also the question of whether the model performance, the model behavior is happening in a timely manner. That really matters for things like, say, an autonomous car, it's making a huge number of decisions in real time. And even there, is there really a correct answer? Well, say, a car is about to turn right, if you are laying out path for the car through some intelligence system, it can know whether a right turn at that point is on route. But is a right turn in that moment the right decision? Well that partly depends on whether there's a car or child standing in the street right where you're going to turn. And so a lot of that again comes contextual and even contextual in realtime.

Ellen: 00:04:24 It is a complicated thing to say how you evaluate machine learning as the whole project, how you evaluate the specific behaviors of an intelligent system, how you evaluate the performance of the models themselves.

Ellen: 00:04:40 Now in the case of predicting the weather, to say, predicting people's preferences to build a recommendation system, there's another thing that's different and that is when you make a prediction about the weather, it doesn't actually change the weather. It's going to be whatever it's going to be. But that's not true in the case of building a recommender. This is one example and there are many others in which the learning system itself is actually interacting with the world that it's making decisions about.

Ellen: 00:05:11 And so, again, if we use the example of a recommendation system, say for online shopping, a recommendation system for people watching videos or listening to music, what you offer them changes their experience and their behaviors in turn feedback and change the model. And so it isn't just new data makes the model learn, but the behaviors of the model and the actions that are taken based on that, actually do change the audience, and that in turn changes the data. So, that's a very complicated interaction that's not true in all machine learning situations.

Ellen: 00:05:55 So really we're trying to get down to the question of, "If you're looking at your system, if you're looking at your model, is it good?" And most importantly, the better question is, "Is it good enough?" By that, I don't mean that it's not important to have high quality that you don't strive for, making really efficient or accurate systems. But all this should be done again, within the constraints of what your resources and your timing are, and what's important, what actually matters are the goal that you're using this machine learning system for.

Ellen: 00:06:32 As it turns out, there is a tradeoff, not surprisingly, between the effort that you put in over time and how model performance changes. Or you can even translate that into the value that you're getting from the behavior of that model, the value you're getting from that overall project. And for most projects, I see it's a very typical set of curves. The start off is the lower curve, the solid line there representing efforts. Over time the current effort can actually increase [inaudible 00:07:08] without the same kind of response in terms of performance. So at the left side of this graph, you see that model performance goes up very, very quickly in response to the effort being put into the project. But then you see that model performance begins to level off. Another way to say that is the value you're getting out of that additional effort becomes less and less per unit of effort.

Ellen: 00:07:36 At some point there's going to be a crossover where the effort isn't really worth it in terms of the additional value of the additional performance that you get from that effort. Where is that crossover point? Well, that really depends on you, your resources and what it is that this machine learning system is trying to accomplish. If you have an intelligence system that is providing feedback, say for a medical procedure for medical diagnostics to try to put somebody on Mars, it may be that that additional very small percentage of added performance at the added accuracy is actually worth the much larger effort that it takes to move something that's already very good or even better. But there are other situations and many more of them.

Ellen: 00:08:28 Again, I'll use an example of say, online retail shopping, online recommendations where that additional effort to make, say the behavior of a recommendation system better, a better predictor of people's taste. That huge amount of effort to make it better once it's already very good, it's often not really worth it, you don't actually see a reasonable return in terms of the value often that brings back to the business. A classic example of this is many years ago with Netflix, they have launched their recommendation system, it's working extremely well. And then they did a very clever thing, they had a very highly publicized contest with a huge cash prize for anyone who could improve their recommendation system by at least 10% in terms of it's performance. And within the first just very few weeks, I don't remember the exact number, but it was literally just a few weeks. A couple of the teams actually turned in models that were performing very much better.

Ellen: 00:09:36 But the target that Netflix has set up is if you could increase performance by 10%. And what they were actually seeing are people who are getting up around 9%, which is significant step. But they didn't get high enough to hit the prize for another two years. And in the end it was a team that's actually combined team. I think it was actually hundreds of people who did finally push it over that mark. But if you think about it, was that tiny bit of additional increase past the first few weeks, the next two years of efforts, was that really worth it? Well, I think it was worth it for Netflix it's great publicity. But in terms of the value that we've gotten back from that additional level of proponents or probably not. And so that's a classic example, you would want to look at this in terms of how you evaluate your systems because part of evaluation is to say, "Can you really recognize what success looks like?"

Ellen: 00:10:35 Now let's move on to a specific example, and I do want to remind you, there's also the issue of say even how fast or how accurate the model performs, if it's ... you'll have defined goals. And when it gets beyond what actually has a real impact on what you're doing, you could think of that as just bragging, right? Okay. So the other example I wanted to show you is let's take ... this is a modification of a real world story from some years ago, but there was a company offering online retail discount offers. In other words, that company had a recommendation system that was personalizing offers, had built a coupon collector. The coupon would reach out to various vendors who were offering a discount coupons on products and making those coupons available, but targeting them to the right audience. So the coupons would have a good effect and encourage purchases.

Ellen: 00:11:38 That was the basic system here. This recommendation system was basically scoring the coupons that were available. In other words, the items for which there were discount coupons for sale scoring them in terms of a coolness factor for how likely they were to induce a purchase at the targeted audience, and that's how the system works. And so they built this recommendation system, they did various types of evaluation of the model itself and it looked pretty good. It didn't look absolutely fabulous, but looked pretty good. There was a good correlation between the probability of purchase and the score shown by this graph. And you see it starts to [inaudible 00:12:18] off a little bit at very high scores. But it looked fairly reasonable. So this was launched it into production.

Ellen: 00:12:25 And the question then again comes is, was that a winning system? And it turns out in this situation, no, and surprisingly, it really wasn't. The behavior of the model does not have the impact on the model, the recommendation system did not have the impact that they had hoped it would have on the business in terms of lifting in purchases and so forth. And so a different way to say how would you evaluate that system, is rather than looking at the behavior of the model itself. If you look at the overall impact that the system has on the goals it was aimed towards, here it was supposed to increase purchases, it really didn't increase it. Now, do you just draw up that system and say, "Well machine learning, was it worth it?" They didn't throw out the baby with the bath water.

Ellen: 00:13:13 Then they went back and began to examine more carefully the behavior of the model and to try to understand what was going wrong. And to look at the data to look at the model. And what are the things that that you can look at? Is this density distribution for the scores, the scores rating these items that are up for which of their discounts and how likely are they to encourage a purchase. But what you really want to see is out here in the very long tail items at the very high score. Because that's really where you're going to get the best lift above things that are obvious and here it tapers off very, very quickly. And so it turns out that's really where the problem lies. It's not that the recommendation system was doing poorly in terms of recommending items that really would be attractive to the people who were being targeted.

Ellen: 00:14:12 It's that there just wasn't enough volume, there weren't enough items being recommended for it to actually have a significant impact on business. So, now we can take a step back and say, well then if the model seemed to be performing okay, why was that problem there? Well, sometimes the trick is to look at something that's very familiar and look at it in an entirely different way. And in this case it turns out, it was actually a matter of time or tiny in the way the whole project was set up. So, let's go back to this diagram. What was happening is, say a vendor would publish their discount coupons at one o'clock in the morning, and so the coupon collector would be set up to actually collect information about that coupon shortly after that, give enough time for the system to update.

Ellen: 00:15:04 So, if a vendor was offering their coupons at 1:00am, the coupon collector was set to collect coupons at about 1:15 in the morning. And that's how the system had been set up. When they went back and started really examining what was going on, it turned out that many of these vendors had changed the time at which they're making coupons available. They're making available hours earlier. And what was basically happening, this is a human behavioral thing, it's not a function of a problem with the model itself. It's like going in say on the big shopping day after thanksgiving, at least in North America, people who come in very late in the day and there are certain items that they want to get at a great discount, but they find that they were all sold out at nine o'clock in the morning. But basically the same thing was happening here. Some of the best items had disappeared before the coupon collector ever picked up the discount coupons.

Ellen: 00:16:00 So the recommender was doing a pretty good job on the inventory that was able to look at, but that inventory wasn't good enough, there wasn't enough of the good inventory to have a real impact for the business. And so all we had to do is change the timing of when they're basically tapping and collecting this data. So, it's an odd latency question. It wasn't a problem with latency in terms of the behavior of the model and the recommender itself and how fast it returned to response. Although that can be a problem in another situation, in this situation it was the timing or the latency of getting to the coupons to start with. So this is an unusual situation specialized for this particular project, but there are some lessons here in terms of how they dealt with it that you could I think carry into the projects that you do.

Ellen: 00:16:55 One is to do monitoring, to go back and examine different aspects of the system, ask yourself what behaviors you would have expected and then keep retesting and re monitoring as you change various aspects, various parameters in the system. In this case, they changed the timing when they were collecting the coupons and what they saw is much improved result. You can see this long tail, there's more volume, there's more density now. So, the system is working pretty well, making good recommendations and there's enough volume of the items being recommended to actually have a positive impact on the business. This model delta diagram is just a little more sensitive way to look at the comparison of those two bar graphs that we just looked at. This lower orange or red colored line represents the less well before in performing system. And when this change was made in terms of timing, you can see that they get quite a bit of lift.

Ellen: 00:18:00 So one of the lessons is this, and this really has an impact in any system, is that domain knowledge matters. It isn't just the expert who's running the model, who actually just looks at the performance or parameters of the model itself, but it's the larger question of whether you have set this in a context that makes real business sense, you have a way to take action on the results. Is all this happening with timing, that makes sense in terms of whoever needs to interact with that model. Do you understand the data that you're dealing with well? Do you understand the business or the context or the scientific research context in which you're building that model? Or you're asking the right questions and do you know how to recognize success? Do you how to recognize result when you do evaluation?

Ellen: 00:18:59 So that domain knowledge matters as much as anything else. In this case, some of the steps that they took can be good reminders for what you're doing as well. That they won't apply to every project but they would apply to a lot. A lot of this is generalizable to other situations. Evaluate and quickly as possible, the faster you find out what's going on, the faster you can get good results, the faster also that you're closer to the condition of the world at the time that you are evaluating so you can understand what's happening better.

Ellen: 00:19:40 In their case they needed to see that the score and probability of purchase correlate as tightly as possible. Think about what measurement you would have in your own system to say, is at the most basic level [inaudible 00:19:52] is the model doing what it is designed to do. And then expanded a little bit, again, in terms of context. The situation in this context was, did they have enough things to recommend to actually have an impact on the business? And in the case of recommendation, obviously this needs to be spread among many users. You need to have a large audience, you need to be able to recommend to a large audience.

Ellen: 00:20:20 What are some of the other things that could improve evaluations and improve what evaluations does in terms of the end goal that you're after?

Ellen: 00:20:29 One of the things to keep in mind is that machine learning is iterative, and as a result evaluation is also iterative. In this diagram, just as a reminder that you have steps where you're training a model, you're training many models. This is a simplified diagram usually using tens or dozens, hundreds of models at times, not just building one model and hoping it will be successful. You train a model, you evaluate it, maybe you go back and retrain it at some point through evaluations of the behavior of the model itself, by whatever parameters you've decided to test, you decide it's good enough and you go ahead and deploy into production.

Ellen: 00:21:11 But evaluation should not stop at that point, should continue to monitor what's going on, monitor the behavior of the model itself, but also monitor the impact that it's having on the system which you're using the machine learning and you're going continue to train new models, or we train those models and we deploy them. So, this idea of evaluation should be happening for slightly different purposes but happening throughout the entire lifecycle of the system really is never over. So, again, this idea that it's not an audit, it's a matter of building a successful model, doing a test, getting report card back and saying that it did well.

Ellen: 00:21:53 Now for people who are used to implementing machine learning systems, they're probably pretty well aware of that need. And so one of the things that I want to remind you, is that you may need to communicate that to other people who are making decisions about your resources, your team, setting goals for what machine learning system that you're building is supposed to do. They need to understand that evaluation isn't just a simple, successful score and you're done. They need to understand that these systems are always in flux and need to continue will be monitored and updated.

Ellen: 00:22:34 Now, another issue about what could make evaluation difficult and then we'll look at what might make that probably easier to do is that you're not only looking at hundreds of models, But you're dealing with a lot of data. The one question is when you go back and retest a system, do you actually know what you did? Have you kept a good records? You probably have versions of your model, versions of the code, part of your model in something like GitHub, that's pretty obvious. But keep in mind that a trained model isn't just a code, it's also the data. The data interacting with the model actually changes the model, this is a difference between coding for machine learning and again coding in traditional development.

Ellen: 00:23:23 And so you also need to know what data that model looked at, and I mean exactly what data. And so you need to do data versioning in parallel with versioning code. I talked about that in more detail in the last webinar. That's one of the things that you need to track. Not only to know what data you used to be able to go back and access it again and to do that in a way that it's not a cumbersome. So that kind of preservation of data, that kind of management of what we can finally get access, is it protected from being changed, rather dated by other people, is a key part of evaluation.

Ellen: 00:24:02 If you can't evaluate against the data that you need to look out or even taken away a lot of the meaning of evaluation. And you want to monitor everything, just monitor as much as you can because you never know which aspect of the system it's going to be [inaudible 00:24:17] that's going to make the big difference in performance. And so connected to all of that challenge, so many models, how do you compare with each other? How do you compare them to their own behavior at an earlier time? Can you repeat what you've done? And something that can make that much easier to do is really effective use of containerization of models. Having a containerized environment certainly makes it more convenient to deploy the model wherever you need it to do that very quickly.

Ellen: 00:24:54 And I think people usually think of containers first and foremost as a convenience. And they certainly are, especially when you're dealing with so many models. But in this context of evaluation, containers and containerization of machine learning applications have some other strips. And probably I think the most important one is that it gives you a repeatable, predictable, and definable environment. So, let's break it down to these different stages where you're doing evaluation. When you're looking at in the early stages of training and evaluation, you're going to be comparing ... You're first and foremost looking for each model, how it performs, but you're also doing model to model comparison. Which models are behaving the best at that moment? Which ones do you want at that moment to deploy into production?

Ellen: 00:25:51 And so having an environment that you can define that you know exactly what the training data was, you were running it with all of the dependencies in exactly the same way as the other models to which you're comparing it, is a very valuable thing. This diagram is just suggesting that you can run multiple containers, the colored blocks here represent all of the program and dependencies of container image. The upper part with the lines of keys are the keys and codes that are needed at the time of deployment, you can run multiple models on the same server, on the same system. The really good point of context there is that load balancer.

Ellen: 00:26:36 So the other aspect beside a defined environment so that you can do better model to model comparison, repeatable environment, a defined environment so that you can test a particular model under different conditions to see how it's going to behave. But also this idea that you can be running multiple different models under different conditions at the same time against the same data. And you can do that because containers provide valuable isolation. If they isolate the behavior of the models from each other, and you can even be doing some testing at the same time as you're running things in a production system and do that fairly safely. And so that gets back to the idea that setting up your basic design for machine learning and machine learning evaluation, deployment of models as a micro services style system. We see a number of people taking advantage of doing this as streaming microservices where they have extreme transport layer as the lightweight connector between services and microservices.

Ellen: 00:27:43 But however you set it up, having the flexibility and isolation that microservices approach gives you and making use of containers to make that run well. It can be very valuable in general for machine learning but essentially for the case of evaluation. Now one of the things I call your attention to in this slide is notice that there is a dataware layer there between the containers and the metal, the machines on which they're running. That dataware layer is very important, if you want to be able to run containerized applications that are stateful. The containerized applications in containers are going to need to access to the data as input data, but they also will have state data that needs to be a persistent otherwise remember that the lifetime of data generally should longing out last, the lifetime of a container, containers are spin up and down.

Ellen: 00:28:43 You would lose data if you're storing it in the container. You lose that lightweight nature of container. So having a dataware layer that is a data platform to which you can persist data and data in different forms as files, as stream, tables, whatever you need to persist is a very important aspect of running containers in general. And particularly for this issue of using containers when you're doing evaluation of machine learning model. So you need a data ware layer to orchestrate the data, in the same way that you use some framework such as Kubernetes to work a straight the containerized applications themselves.

Ellen: 00:29:24 Going to use the example of the MapR data platform, which serves as that dataware layer, MapR data platform can actually persist data from containerized applications in the form of files or streaming, a string transport event by event, logs or as other one to data even as databases. It's doing all of that within a single system, all look in the same platform within the same code. And it's actually orchestrating that data to managing the data in the same way that a framework such as Kubernetes is orchestrating the running of the containers themselves. So this is a wonderful duality, really strong parallel, not only for machine learning, for other systems as well where you want to make use of container

Ellen: 00:30:23 Also an example that you can make these systems run much more efficiently if you don't have to program and all of the logistics of handling models of handling data into the program themselves. A lot of those capabilities should be handled natively after level of the data platform and that can make a big difference and basically separated the concern of data scientists of data engineers from people who are doing system administration. Those make the systems more efficient as well. Now one of the aspects that we talked about that can make machine learning evaluation more effective, more accurate is to be able to keep track of what you've done, to actually know what data have you use, and to be able to do data versioning in the same way that you version code on GitHub and having something like snapshots that are available through MapR makes data versioning incredibly easy to do.

Ellen: 00:31:23 That's also an aspect of being able to run these systems has multi tenants system and that fits the kinds of patterns that we've been talking about in terms of how the architecture for a big machine learning model both in terms of development and production. So just keep in mind that you should look for capabilities, they're are the data platform, that dataware layer that handles a lot of these logistics natively instead of having to program them at the application level. This isn't just a matter of convenience. This makes your model more performant and it makes the results more accurate. You're less likely to introduce mistakes because a lot of that is being handled in a uniform way at a different level.

Ellen: 00:32:13 Now, I mentioned that having an approach using microservices style, flexible style approach, we have isolation between services and making those lightweight connectors be a edit fence dream is a very powerful approach. A capability that's good to look for is for that stream transport to have very capable geo distributed replication. And when you have that, you can build these systems basically from the IoT Edge, where you're collecting data from sensors that term on premises data center to distributions. Maybe you're doing analysis and plowed, maybe your GPs are on premises for deep learning models, maybe you're boosting out to cloud.

Ellen: 00:33:03 It gives you all the different options and using stream transport as part of this system is a very efficient way to handle models deployment as well. In an earlier book called Machine Learning Logistics that I did with Ted Dunning and some early Webinars, we've talked about a design based on streaming microservices called Rendezvous Architecture. But I just want to use them as an example here to remind you of some of the things that we've talked about with evaluation. So we'll give you a link at the end of the Webinar to get a free copy of that book so that you read about that approach in detail.

Ellen: 00:33:42 But here I just want to call your attention to what we're representing here, the horizontal cylinders are strings of data. We're deploying models in the earlier stages of the center of the diagram, you have a number of different models that are active at the same time, you're doing direct model to model comparison. The scores from that are being considered and various models are being in a continuous fashion rolled into production, a little back out in a very fluid fashion. But the aspect of this that has to do with evaluation, you're able to do model to model evaluation very accurately, again, with the assumption that these are containerized models. The notice up here at the top, one of the models we're calling it Decoy.

Ellen: 00:34:26 Just to remind you that that's one way to address this challenge that I mentioned earlier is do you know what training data, for example, or model. What was the input data? What was the exact data? It's coming from generally the same data pool, but what was it at the time that you did the evaluation? It's like capturing a moment in time and that's what this Decoys model is set up to do. It looks like a model and acts like a model, but it actually doesn't take any action on the data.

Ellen: 00:35:02 The output here is just to archive the exact input data that happened at that moment in time. And that's something you can go back to for comparisons, for retraining, for referential training. That is a very powerful approach. Whatever system you set up to have something that acts actually archive data, that is exactly the data that was being seen by models while they were in training or while they're in production, while they're being evaluated

Ellen: 00:35:36 So, that's pretty much the content that I wanted to cover for you today. But I'm going to leave you with a teaser, not only as evaluation, important to look at how a model is running. Is it running? Is it meeting time constraints? Is it producing the results that you expect? Can you recognize success when you see it, but also what are some of this very powerful things that you can do to tweak the performance of models and then evaluate it and see that you get a better result? Ted Dunning is going to talk about this in much more detail when he does the deep dive into model evaluation in about a month. But I want to leave you with this idea. There are actually things you can do that make performance worse in order to make it better. And what we mean by that, if you only went by offline evaluation of a model performance, if you took this next step, you would say, “Well it's no good. the model got worse and threw out the system.”

Ellen: 00:36:35 But in fact this is a step of something you can do. It is enormously powerful in terms of improving the overall performance of a recommendation system. And this is a really simple technique. Many of you will have heard of it but it's called dithering. And the basic idea here is that imagine that you're making recommendations system for music to people and the recommendation system is personalizing these recommendations. This is all happening basically online in real time.

Ellen: 00:37:11 So somebody comes to a site and as they're listening to music, other music is being recommended to them. And the click through rate obviously is much higher basically above if all they have to click down to a second page, they're less likely to see the recommendations. And so you want your recommender system to recommend the best choices in that top, I don't know, say 20 things that are being ranked, the ones that are recommended below that, not that they're terrible, but they're just not as strong.

Ellen: 00:37:41 But the key thing here, why dithering matters is that the first you go down in the rank, the less likely anybody's to ever see it. And remember that unlike that weather prediction model, when you're doing recommendation, the interaction with the user actually changes the result. The act of running, the recommendation system of doing the machine learning also changes the behavior of the machine learning system.

Ellen: 00:38:09 And so the system learns from what people do if you never show them things that are below those top recommendations, even if this top recommendations are very good, your system, we'll never learn from that. And I think the simplest example of this to think of why it's very powerful, why might somebody be using a recommendation system? Awesome. In some cases for telling things to get up sale and other cases just to keep us a website sticky so that you have people coming good return traffic staying on the site.

Ellen: 00:38:41 And so keeping their interest duct is very important as it would be in this situation of recommending music. Suppose there's a new performer, a new song that's out and it's going to be enormously attractive to people, but it's new so it doesn't have much of a history. In that case, your recommender in the traditional way is not likely to rank it high enough that the audience is going to see it. So they aren't going to be interacting with it much. So your recommendation system isn't going to learn that they like this new music.

Ellen: 00:39:16 So what you do with dithering is you take a few items that are further down in the ranks and whatever, get them into that, above this all ever get them into that top 20 and you select those and actually move them up into the upper layers of what's being recommended.

Ellen: 00:39:35 And that means in the moment offline, it looks like your recommendation system is less accurate because you have some newcomers that don't have a known track record in there. But the result of stirring this up basically stirring upon is that you end up getting usually a very good response to people. You quickly find out which of those new things are worth including in the mix. And so the overall performance of your system goes up a lot, even if the offline performance of the model itself looks worse. And that's one reason that online recommendations, which is even more challenging is very important to be able to do. And that's a big part of the topic that Ted we'll be talking about next month.

Ellen: 00:40:18 So it turns out just from real life experience over the years, we have a number of people say they're trying dithering, which is a very simple procedure, made more difference in the results of their recommendation system than anything else that they did to the system.

Ellen: 00:40:35 That's one where that tradeoff of effort to the return value is very, very good. So leaving you with those thoughts, our take home lessons from today are that machine learning evaluation is more than just a simple report card. It's important for you to recognize that and to communicate it to the people who are setting your goals and controlling your resources. It's an iterative process and should go on even in production. You're the monitor in many different ways and do this as quickly as you can so that you have really good, fine tuned information about your system ongoing.

Ellen: 00:41:16 But also keep in mind the domain knowledge, the overall goals, the sources of data, I started to say how the world may be changing but that's wrong. It's the world will change. You're interacting with people, the public you're interacting with will change. Data will change and you need to know how to evaluate the effect of that data on models ongoing. Even a model that's performing very well, may not perform as well as the world changes. You need ways to evaluate that and to be able to communicate that back.

Ellen: 00:41:51 The very typical places when you're trying to improve the performance of a model or a system is to look at the data that's generally the first place to look. Do you have good quality data? Is this the right data for the question that you're trying to pass? And are you asking the right question? A lot of that again comes back to good domain knowledge. Have you set realistic goals about what success should look like, so that as you do evaluation, you know how to interpret the results of evaluation? And keep in mind that this combination of using Kubernetes is to orchestrate containers of containerizing your machine learning applications and having a data ware layer where you can persist data and how all of those operating together very efficiently can make multimodel evaluation and ongoing evaluation of machine learning models are much easier to do and much more accurate.

Ellen: 00:42:51 Now there are a number of free resources that MapR makes available through this website. I've written several books with Ted Dunning and a couple of that call your attention to the most recent one is called AI and Analytics in Production. You'll have these links on the slide, but I think David, you'll be sending these links out to people who signed up?

David: 00:43:13 Yeah. So after the event, we will send the link to the recording of the webinar slides from the Webinar as well as the resources that Ellen's about to a walkthrough.

Ellen: 00:43:23 Okay. So I'll just walk through these real fast to call your attention to what you might want to look at and you'll get the links later. Streaming Architecture is a very basic approach. It's a really strong and powerful thing, not just for machine learning. And we see more and more people adopting this approach because of the flexible and the power that it offers. And this approach of Streaming Architecture is the underlying foundation under the Rendezvous Architecture for machine learning model deployment that I referred to earlier.

Ellen: 00:44:04 This bottom book here, Machine Learning Logistics, talks about lots of different aspects of handling the stuff outside of the algorithm, that really is what makes a difference in making the machine learning successful. And it goes into detail about Rendezvous Architecture, the first of books we wrote. This is now four and a half, almost five years old. This little book on the upper left about recommendation, but it is still a very powerful approach, very simple and used in highly sophisticated recommendation systems, but the approach itself is simple and accessible by anybody. The one on the right talks about some of those foundational issues that you look out when you're trying to do anomaly detection.

Ellen: 00:44:48 There's a new book, which is a guide to AI and machine learning. There's also a book about Kubernetes in the context of machine learning that was done by Han Carrington, but those are available through MapR as well. This is the list of just a sample of a few other resources, some other webinars, a blog that I wrote, this excellent webinar recording from Carol McDonald without streaming data pipelines and data science in the healthcare sector. MapR offers extensive on demand training because the MapR Academy, I've drawn a circle around a new offering, which is an introduction to artificial intelligence and machine learning. This is a very basic introduction.

Ellen: 00:45:29 And if you are already doing machine learning yourself, it's something you may want to appoint other people on your team that you interact with who are less familiar with machine learning. This is a good place for them to start. Please continue to support women in technology. This is not just good for women, but it's a good for society and from one woman in tech I say thank you very much for coming.

David: 00:45:54 Alright. You said you want to go? So at this time, we have a little bit of time for some questions. Just a reminder, you can submit a question through the chat window in the bottom left hand corner of your browser.

David: 00:46:27 I'm going to give you guys a couple of minutes here. Just reminder, Ellen pulled up her slide with her contact info again, feel free to shoot her a question directly as well. Give it. All right. I don't see any questions coming in, so if any questions do come in, yes, then we will be making the slides available in a follow up email that we'll send out later today. All right.

Ellen: 00:47:31 So we have a question about the idea that data actually changes the model and at the simplest level, I'm going to compare this to assumptions that people have about basic software development. I think in basic software development, obviously I have the idea of that code is interacting with data. You have input data, it takes some action on the data. You've output data and you're going to persist that someplace, but the code itself is the same after you run it as before you run it. In the case of machine learning models, the code itself can actually change in response to the data. It's learning. It's adjusting parameters based on what sees and I really encourage you to come back.

Ellen: 00:48:27 I think we haven't gotten an exact date but I think it's going to be mid to late May when Ted Dunning does a deep dive into machine learning evaluation and he would also be able to explain this in more detail and in fact I'll also have him follow up with you on this question because he can give you specific examples of code changing, but I think it's that idea is to remember that data is part of the model when you saving things in different versions of code in GitHub, you feel like you've got what you want, but if you don't know the data was exposed, you really lost half of the story. And so, oh, there's our surprise. So Ted Dunning just walked into the room. I'm actually going to have him jump into this question about how I mentioned that in the case of machine learning, that actually exposure to the data actually changes to code the program itself.

Ted: 00:49:33 Machine learning process itself is the process of learning a small computer program from the data. And so what's happening there is certain parts of the program are being learned instead of coded. Is that where you're after all?

Ellen: 00:49:54 Yeah.

Ted: 00:49:56 That aspect? And it changes the exact nature of the programming process. In traditional programming you only add data when you run a program whereas this inject data early in the development process and that changes the whole dynamic of how you manage them.

Ellen: 00:50:21 Okay. And then we had a request to go back to dithering again and explain that in a little more detail.

Ted: 00:50:33 Happy topic.

Ellen: 00:50:36 Run back to those slides. And so I'm going to give again, my very, very simple explanation of dithering but if you want to go in a little more detail, we don't have a ton of time but Ted is here and he's certainly the expert on it. The basic idea of dithering is the need for it. I'll say it that way. In a recommendation system, if your recommendation system is working well, if the top level choices are what's being ranked as the things that are the best connection for people's preference is working well and so you're displaying this on a website, then people will only see things that they like but that they already like. And so initially that works very well, but people can get tired of seeing the same things, and more importantly if your system isn't learning about any new things that are coming in.

Ellen: 00:51:33 And so one way to improve that and the goal of dithering is to basically mix it up, to pull a few items. So if you're looking at the top 20 or the top 50 items that are being recommended is pull in a few items that at least an offline evaluation says “Don't appear to be as likely to predict preference.” They're not ranked as high, but you arbitrarily mix a few of them into the that top group of ranking and this provides fresh material for people to interact with. It provides fresh material because of the behaviors of people toward those items for the machine learning recommender itself to learn from. And so it's a way of bringing in new ideas, new content and mixing up the system.

Ted: 00:52:25 So I think on what you're saying is that a model can't learn from what It doesn't know, because it only learns from what it shows people. And so if it doesn't show it to them, it never learned about it. So it's got a really great result on the second page, that nobody ever clicks to, that it never finds out just how great that result is. That sounds a lot fair?

Ellen: 00:52:49 It is. And I heard Ted say this before, but a way to think about this is that there's actually an echo. The output, the results for a recommender on one day is really part of the training for the model ongoing. And so it produces the done result, it learns from the results of what it's produced. And it's that circularity if you don't mix anything new into that, then the machine learning model has no way to learn anything new.

Ted: 00:53:25 I probably ought to listen to that too because I probably only know what I know. Maybe I had to try something new sometimes.

Ellen: 00:53:34 I think that's true probably. We had a more in depth question here.

David: 00:53:37 If you click on the question down there, there's a few really good ones.

Ellen: 00:53:43 So for recommender model, usually I only interact with one recommend model at the time or only one model output maybe displayed. How does this affect the diagram I showed where I have a dummy model in archives data and the data is fed to multiple models at the same time.

Ted: 00:54:06 How come you get such good question?

Ellen: 00:54:07 Yeah. I'll go, I'll move the slides back. But that was actually looking at this ... Within the Rendezvous Architecture, we talked a little bit about a Decoy model being something that just doesn't act on the data, just archive.

Ted: 00:54:27 So it is true that any given person only sees results from a single model in the Rendezvous or in any other situation, you can only show people one sort of thing. Now with recommendations to give you a list of things, you can actually mix that list from a couple different recommenders, but the fact is what you show people is what you show them and you don't get to have a do over. On the other hand, the learning is based on the mixture of all people. It isn't just what one person does. And so in the Rendezvous Architecture, the Rendezvous server itself select which result to send. Then I'm wondering if that answers two of these questions.

Ellen: 00:55:30 Okay. I'm going to jump in and answer another question here. We have, “How do we incorporate domain experts knowledge input into the machine learning model as the model learns only from the data?” That's a really intriguing question. I like that question. My take on it is this. The domain experts knowledge is what should come into play as you're designing and building the system to start with. And a key aspect of that is sometimes the person who's actually building the model, tuning the model, the people who are doing the data engineering to deploy the model, to do feature extraction, to get data available to them all to build a pipeline, the person actually understands that whole system and knows how to frame the question, what data to go after. That can be one person. But these days much more often that's a team of people. And people tend to think of the data scientists themselves as the data science team building the machine learning system.

Ellen: 00:56:43 I think increasingly people are saying that data engineers are part of that team, very essential part. People have to have very good data skills. They don't need to build and tweak the algorithm itself, but they're doing a great deal of everything else around that. The larger part of the effort. But I think where it tends to fall off is that, beyond that team who's actually building the machine learning system are often the people who have the domain knowledge. And this is even more true where you have a company that is offering machine learning as a service. And so they're going to be hired by a number of different companies in different situations and they certainly wouldn't be expected to have the domain knowledge. And what's important is to recognize the importance of that knowledge to learn better communication skills.

Ellen: 00:57:40 And a soft skill, it sounds like hand waving, but it is so important. People who do machine learning need to recognize the value of that other knowledge. And I'll just be blunt about it, not look down your nose because you're talking to people who don't know how to build a model. But the people who have domain knowledge who are asking our data science team to build a machine learning system for them, they need to learn a little bit about machine learning. They need to understand. They don't have to be able to build the algorithm either, but they do need to understand what matters to have realistic expectations in terms of evaluations, in terms of timing, in terms of the resources, in terms of say this idea ongoing evaluation. So they have to learn to not just say, “Well those people are experts. I don't know how to talk to them.” That's not good enough. And that communication between those groups, especially at the outset of a project is absolutely essential.

Ted: 00:58:45 Did you talk about the coupon example?

Ellen: 00:58:45 I did. I used that example and that was a graphic where-

Ted: 00:58:48 Perfect example.

Ellen: 00:58:48 It wasn't the knowledge of the example I did earlier with the online that discount coupon. At the end of it, it wasn't the problem with the model itself, how it was running, it was that the original concept of the system had made an error in terms of how it was said. But if you haven't been doing really good monitoring, really good evaluation of the behavior of the model and at different times you couldn't know that, that didn't look like the problem. Obviously that's where you'd start. So both aspects of the evaluation were important.

Ted: 00:59:28 That was an example where domain knowledge got us the right data and then the problem was easy.

Ellen: 00:59:34 But I'm guessing here as a heart of your question is not that you have just a direct link in to your model is somehow in an automated way, interacting with the domain, I think.

Ted: 00:59:47 For future engineering though is a big deal and expert judgments for training.

Ellen: 00:59:53 Yeah.

Ted: 00:59:54 Those are all domain expertise.

Ellen: 00:59:55 Right. But I'm saying, I think the ability to involve, to take advantage of domain knowledge, at least at this point of the world is a human based thing. It's not something that's automated into a model itself.

Ted: 01:00:09 Absolutely. The models have very poor communication skills.

Ellen: 01:00:16 Okay. I think we have about five other questions, but we will answer those offline. We're back to the offline because we're-

Ted: 01:00:26 We're out of time.

Ellen: 01:00:27 We're out of time.

Ted: 01:00:27 You're out of time, I'm not.

Ellen: 01:00:30 Thank you so much everyone who joined us and also especially thank you to those who have posted a question.