ML Workshop 3: Machine Learning in Production


Ted Dunning PhD

Chief Application Architect, MapR Technologies

How Rendezvous Architecture Makes This Easier in Real World Settings

This third deep dive workshop in our machine learning logistics webinar series focuses on how to better manage models in production in real world business settings. The workshop will show how key design characteristics of Rendezvous Architecture makes it easier to manage many models simultaneously and to safely roll-out to production, how to achieve scaling and what the limitations for Rendezvous are.

Join us Tuesday February 6th as Ted will cover the role of containerization for models, serverless model deployment, how speculative execution works and when to use it, how to fold in the requirements of SLAs for practical business value from machine learning, and how the rendezvous architecture can deliver reliability in production. Ted will explain how these methods for machine learning management play out in a range of case histories such as retail analytics, fraud scoring, ad scoring and recommendation.


Ted: Hey there. This is the third webinar in a series about machine learning logistics and about the rendezvous architecture in particular. This recording is actually a little bit of a grudge match, you might say, a redo. We forgot to hit record on the original webinar, and so this one is going to differ slightly from what people heard when they were online. We're going to go right ahead and stay as close as we could. A little bit later in the webinar, Ellen Friedman will be here, and she'll bring us some of the questions that were asked in the actual webinar. Let's get started.

Ted: Let's start with a review of what the rendezvous architecture is here. You can see the basic parts of the rendezvous architecture here, probably a little bit simplified from a real production instance, but the basic outlines are there. In particular, there's a proxy that accepts requests from the outside world and takes responsibility for interfacing with the actual rendezvous architecture. That re-interfacing involves putting the requests that are received by the proxy, adding, of course, a return address, we'll talk about that later, but putting every request into an input stream. That input stream allows all of the models for this particular task to get exactly the same input. That means they can all simultaneously evaluate every request, and they can all produce their outputs. The other output of the input stream is to the rendezvous server itself. This is not the same as the rendezvous architecture. This is just a component, but it's the key component, the rendezvous architecture.

Ted: All of the models evaluate that input, and all of them report their results to the scores stream. What that means is that as soon as any of the models has a result to report, they do that, and that goes into the stream. As other models report results, that goes into the stream as well. The rendezvous server then ... What happens there is the original request that was seen in the input steam is used by the rendezvous server to start a post office box, kind of just a place to coordinate results for a particular result, a particular request.

Ted: As results are reported by the models into the scores stream, the rendezvous server sees those. What it does is it either says, "Oh okay. I'm done. I've got a good result here", or it holds onto results from models until it's decided that that is or is not the result that it wants to return. The way that it does that, the reason that it holds onto things, is there's a thing called the rendezvous schedule. That's probably kept in a configuration file or possibly a stream even of configurations that the rendezvous server uses. The idea there is that the schedule tells the rendezvous server which is the preferred model and how long to wait for a preferred result.

Ted: After the time period where the preferred result is the exclusive result, the rendezvous server will broaden its parameters. It will look at less preferred outputs. If one of the models that's highly preferred is either slow or out to lunch for whatever reason, because all of the models start evaluating every request at the same time, it's very likely that some faster, but perhaps not as accurate, model has already reported results. As soon as we start broadening our preferences, according to the schedule, it's likely that we will accept a faster result.

Ted: It could be that several of the models, even one of the fast ones, does not report in time, and so ultimately, as things get close to the service level agreement on latency, as soon as the rendezvous server notices that we've waited for about as long as we should have waited, the rendezvous server will default to some standard constant answer which is as safe as we can make it. That gives us a result that's as good as we can get that still meets the SLA. What the rendezvous server is doing there, it's giving us reliability. What the models are responsible for doing is giving us accuracy, at least as much as possible within the time constraints and so on.

Ted: Then the result that the rendezvous server picks is put into the result stream, and it uses a topic that was given to it in the original input steam by the incoming proxy. The proxy puts a topic into that original request, and now the rendezvous server puts the result into the topic whose name matches the return address specified by the proxy. The proxy can emit a request and then can sit there and wait on the correct topic, and we don't have to worry about other proxies waiting for the wrong results or having to funnel through all of the results for all of the proxies that are waiting. Instead, the proxy will wait for the correct result. It'll be able to return that as a response to the original requester. Of course, that fact that we can use lots of topics in the results stream is part of the cool thing there about having streams involved here.

Ted: That's what the rendezvous server does. It handles and orchestrates all these models. It provides reliability in the face of possibly unreliable models.

Ted: Here today, what we're going to talk about is some examples of how that scheduling works and how we can scale up a rendezvous architecture or a deployment using the rendezvous architecture. I'm going to give some examples of scaling and so on. Then I'm going to talk about, frankly, the limitations of where rendezvous may work well for you and where it may not.

Ted: Let's talk about that schedule first. That's a very, very production-oriented aspect of the rendezvous architecture, and I think it's practically a unique characteristic. It's not well-encapsulated in most deployment environments, but it really is an important thing. The idea is that the rendezvous schedules define the trade-off of latency versus model priority. The cool thing here is we're actually able to separate the concerns of reliability from accuracy. Because different people typically work against different goals, data scientists tend to work more for accuracy and ops type people work more for reliability, this separation of concerns is really good because that way, we don't have everybody worries about everything.

Ted: The schedule is going to say, "This is what we really want, i.e. we want the model that we think is the most accurate." The schedule also makes concessions to reality, to pragmatism, and says, "Yeah, but if we are beginning to come up against our SLAs, we will take another model." Then as we get really, really quite late in the deadline, we might say, "Well, we'll take the super fast but not very accurate model", or even right at the deadline, we might say, "Well, forget models. Let's just give a default answer that's as safe as we can." That really lets us define those trade-offs in a very, very precise way and absolutely guarantee that we will meet our SLA.

Ted: Normally, the same rendezvous schedule applies to every request that we get, but it's quite plausible that you would have an override captured in the request itself. You would have a new "now do this" type of rendezvous schedule within the context of a single request so that as it comes in, the rendezvous server would ignore the normal production default rendezvous schedule and it would use a per request special purposed schedule. That would be really, really cool because if you are doing QA on different models and such, you might inject a QA transaction that says, "No, no. Just go to this test one." We've seen it evaluating transactions. We've seen it by looking into the streams and so on. Let's just do end to end. Give me the next champion's results, as if I were doing a normal rendezvous. That would be a nice thing to do, built-in suspenders sort of thing to do, as we're rolling to a new champion.

Ted: It's also something we might do not even rarely. Normally, we wouldn't do that very often, but in the case where we're A/B testing two different models, it might well be that we would, as a matter of course, have a lot of transactions that have overwritten schedules. We might say option A is getting like 50% or 90% of the traffic, option B is getting less of the traffic, but still quite a lot. Option A could be the default and option B then could be the override. We would be able to say different users would get different schedules. That would let us make aggregate statements like, "Users in group A prefer this, or users in group B prefer that." We would even be able to look at the interaction of different models, i.e. different preferences of rendezvous schedules, for different cohorts of testing and the interactions with user experience, or even the effect on user experience. The schedules are a key aspect of how you would use rendezvous schedule in the process. Normally, pretty rare.

Ted: Now for scaling up, there's a couple of different ways that you would want to scale up. First of all, and probably the first kind of scale up you would have, is you would say, "Oh we've optimized this decision here about who should handle a particular manual task." That's a targeting kind of thing. Or, "We've highlighted things that might be fraud." That's kind of a fraud control sort of thing. We've done our first machine learning decision framework using rendezvous, and now we want to scale up by saying, "Here's a second kind of decision that we want to do."

Ted: It really isn't such a great idea to have the same type of rendezvous architecture making different kinds of decisions. You could do that, but it's kind of crazy to do that. Much better is to have a different directory with all of those input streams, the score streams, and the output streams and different outputs for the proxies according to what kind of task we're doing. We're essentially scaling up the complexity of the decisions and models that we want to run using the rendezvous framework.

Ted: Doing that's really a piece of cake. At least it's a piece of cake if you have the ability to have as many streams as you like, as many configuration files as you like. Configuration files are easy. You just have a different directory. On MapR, of course, it's easy because you can put streams in a directory and you can put all the container you're using in the same directory, and so it's easy to encapsulate the state, at least the configuration state, of a rendezvous server by itself. With other systems, it might be a lot more complicated.

Ted: Another kind of scaling that you may need is for more throughput. One of the fastest and easiest ways to get more throughput is to have a really fast default model, backstop model, in the rendezvous architecture. What that does is that you can back off and use that model if you need to. You can even change the rendezvous schedule to use that fast model a lot of the time, but then you can also partition the input stream so that there's a lot of threads of control. That gives you the potential to run multiple copies of some of the models.

Ted: If you have a fast default, you may not have to run very many copies of that. Then you have a soft, slow, preferred model. If you have the partitioning on that input stream, possibly partitioning on the output stream, the score stream, as well, but mostly on the input, then your slow model can have multiple copies, multiple instances of itself, running. You can assign different partitions to different instances automatically using the Kafka API for MapR streams (Now called MapR Event Store) or for Kafka itself. That lets you scale up the actual throughput of your fancy, but perhaps slow, model. Of course, the fast defaults can scale up as much as they need, probably much less than the slow models. That'll give you more throughput.

Ted: At extreme volumes, as things really, really, really begin to push the performance of even replicated instances of models, even as ensemble models make things more complicated, or as something we'll show you later for ad targeting really drives that big and the amount of resources needed to evaluate all the models all the time, you may have problems. You may want to cannibalize some of the resources used by some of the instances of the fancy models to run fast models and defer to the fast ones more often, or you may want to go beyond strict rendezvous architecture, cannibalize some of the speed, or I'm sorry, some of the beauty of the architecture, in order to get speed.

Ted: Let's talk about a little bit of that. We're going to talk about two aspects of how to do these speed ups. Suppose that we have one model that's pretty fast. It'll do 10,000 evaluations per second. It's not the most accurate thing, but it's pretty darn fast, and it's way faster than we're ever going to need. Suppose our champion can handle 1,000 transactions per second. That's good enough. We expect it, but for some reason, we have just too many customers right now and we get a burst for 10s of minutes at 2,000 transactions per second. The champion just can't handle that. We didn't have it partitioned enough or something. We can't just add new instances of it, for whatever reason. Maybe we don't have enough GPUs to handle that speed.

Ted: What we can do is we can have the champion, and this should always be done, the champion can be arranged so that whenever it gets a request, it works on it, and in a few milliseconds, it gives a result, but by the time it comes back around to look for more requests, several requests have backed up. That's because results or requests are coming in at 2,000 and the aggregate throughput of all the instances of our champion is only 1,000. There's going to be several requests stacked up. What we can do, a really, really simple thing we can do, most of those stacked up requests, the early ones, probably already have an answer from the fast model, and so we can just go skip, skip, skip, skip through all of those requests except the most recent one. Then we'll evaluate that most recent one. We're going to be able to do a result every so often.

Ted: The champion can skip requests whenever it finds out that it's fallen behind, and the fast model, which of course is evaluating for everything that comes in, it will cover for the champion when the champion can't show up for work. That's really, really a cool capability. It means that our architecture is very, very resilient to these overloads.

Ted: Here's a picture. Suppose we have the input ... This is just part of the rendezvous architecture, the part that has to do with models, and we've got three models running there. Model one, two, and three. We prefer three pretty strenuously. If three gives us an answer within our SLA, we will accept it. Model two, it's not bad, but it's screaming fast. Model one, we might be keeping it around for evaluation, or it might be a prospective new champion, but it really doesn't matter what it's doing. Model two and three are the two that we have in production that are keeping everything going.

Ted: Normally, requests that come in through the input go through both model two and model three. That's the way the rendezvous works. Of course, it goes through model one too, but we're not worried about that. Model two and model three are going to evaluate it, and we're going to prefer model three results. They may come in even a little bit after model two results come in, because it's faster, but we're still going to wait at least a decent time in order for model three to be able to report. That's the normal situation. Requests come in, both models evaluate it, or at least one instance of them evaluate it, they both go to scores, and we pick model three's result. Cool.

Ted: But when things get hot, model three occasionally will not be able produce a result in time. What'll happen as far as the rendezvous server is concerned is it'll see a result from model two, it'll wait, it'll wait, it'll wait, and say, "Sorry. I got to go", and it'll use the model two result. On the next request perhaps, or somewhere down the line, model three will see stacked up requests, pass those through, and it'll get us a good result, but then it'll go back in overload and model two will be there for us and it'll give us results.

Ted: If model two can't keep up even, then we'll have some default to answer, so we will always win and we will always been able to produce as good a result as we can. Model three, as it's doing it's catch up, could even report, "Don't wait for me. I'm behind", as soon as it can in order to allow the rendezvous server to commit to the model two results as soon as possible. Some of them it's going to wait. Some of them it'll see "I can't keep up" messages, and it'll just go ahead and do that. Then other times, model three will report with good results.

Ted: Always have a default or fallback fast model. Models that fall behind should always discard requests in order to catch up because they should have the confidence that rendezvous architecture will cause somebody to give an answer within the SLA. It's better to catch up and evaluate recent requests than to try, fall behind, and then try to evaluate something that's in the past. We're further behind. Try to evaluate the oldest thing, and we're even further behind. Better to just go ahead, catch up, and do as many as we can in tempo, and then throw some stuff away and catch up again. That's how we can use a fast fallback to scale the performance of a rendezvous architecture, even beyond what the strict limits of our hardware can do. That's because of this speculative execution.

Ted: Here's another idea. This idea says that it can be faster to handle multiple requests at once than it takes to handle one of those requests at a time. The simple reason is that a lot of these models ... Now, we're going to get into math here. This is kind of a temporary excursion for the folks who like that sort of thing. It's a cool hack to make things run faster, but the core operation behind how a lot of models work is that they work on matrix vector multiplications. If you take a couple of vectors, that forms a matrix, and matrix times matrix, which is really just multiple matrix times vector operations, the matrix matrix operation is faster than doing those matrix vector operations separately. This mathy stuff isn't going to be the rest of the webinar, but it's pretty good hardcore stuff for the folks who want to see it, so stick with us. We'll be right back with normal English in just a moment.

Ted: Basically, the idea is we're going to do some of these operations at wholesale rather than retail, and wholesale is cheaper. Here's a picture of how that works out. This is kind of a simplified diagram of a neural network. It's only one hidden layer. Current deep learning where they use fully connected neural nets are many, many more layers. Other forms of architectures, like recurrent models, even have layers that are arranged differently, but this will illustrate the key point of how this is working.

Ted: The inputs interestingly for this kind of diagram are at the bottom. Not on the left, at the bottom. We can see we have three input neurons. That's so biologically exciting to be able to say that. Each of those arrows that go from the inputs represents multiplying that input by a constant. Where multiple arrows go to another neuron, that represents the addition of those multiplication results. The little wavy line in there means that we're going to limit the output with a soft limiter, that's how neural networks work, and then we're going to send the output of that to another neuron, again, multiplying by some constant represented by those arrows and adding them up and applying a limiting.

Ted: Mathematically speaking, we have a vector as our input. That's X0 as the bottom. We multiply it by matrix W1 times X0. We put it into a soft limiting function. We're using the inverse tangent here. The important thing, the really expensive thing, is the matrix multiplication. Then in the next layer up, we're going to do a vector, W2, times another vector, that's X1. That's our intermediate outputs. Then we're going to do another inverse tangent sort of function again. Okay. We have two matrices operations, one matrix-vector product and one vector-vector product. Here's the shape of those operations. Vectors are tall and skinny, matrices are fat and square. We have a vector at the bottom. We have a matrix-vector product to get us another vector. Then we have a vector-vector product. The vector's stood on its side in order to illustrate that as a certain kind of product.

Ted: If we look at those shapes and we take multiple inputs, we can put them side by side so the input becomes fat and a matrix. That means the middle operation is going to be fat times fat, and it's result is going to be fat. The top operation is going to be wide and short, it's skinny still, times fat, like this. The idea here is that we can use the throughput of our numerical hardware more efficiently by doing these fat kind of operations rather than lots of skinny operations. The reason is just because we get to reuse values. If we put it back in mathy talk, the input is now a matrix, not a vector. The middle product there times W1, that's the same [inaudible 00:27:17] we had before, now multiplies by a matrix, and Its result is now a matrix. The top level multiplies a vector times a matrix to get a vector of results.

Ted: Now math's done, but the cool thing is here if we just wait for a couple of requests to pile up, if we're allowed to wait according to our SLAs, then we can evaluate all of them all at once really fast. That's an alternative to just throwing away requests. Just go ahead and evaluate a bunch at a time because that may be just about as fast as evaluating one. If we're only a few behind, let's go for it. Evaluate them all at once.

Ted: Those are two strategies for making rendezvous faster, or three I guess. Partitioning is one. Skipping inputs and relying on a fallback is another one. That's a great one for emergency overloads. The third one is to kind of change the algorithm so that we can use parallelism internally in the math hardware more effectively.

Ted: It may be, and this is just true because this is an idealized architecture a little bit, it's just that we can now actually build more idealistic architectures than we used to be able to do. A lot of the time we can use full-on rendezvous. Sometimes we can't. One example of where we really can't is where the model evaluations are happening so fast or so on or they're so expensive, which makes too fast just a bit slower, that we can't afford to do speculative execution on every, every request. Might only be able to evaluate two models on every request, or even just one. Okay. That means that we can't really do the full-on rendezvous. We should do as much as we can, but ultimately, there's a cost of execution.

Ted: We also have to recognize that when we're starting out, we should walk instead of run. When we're building a minimal viable product for something, especially in the startup world, zero downtime, absolutely guaranteeing SLAs, and so on, may just not be all that important. We all hope that it will be important eventually, but when you are just starting out, simplicity may be more valuable than reliability. What you can do is you can use a proxy that directly talks to a model and just gets the result and sends it bad. That's kind of the way most open source systems like PredictionIO work. Apache PredictionIO has a proxy, and it directly gives one model to request and returns that. It's kind of like a conventional load-balancer would do in a web serving framework. That's fine to start out with, but it is probably not good enough once you have a big audience, once you have big expectations on never having any downtime, never stepping outside of SLAs. Just at the beginning, you may be allowed to be a bit sloppy.

Ted: There's another place where rendezvous just doesn't make a lot of sense, and that's where the context of a request is just too big to be passing in every time. This could be because you have a long going interaction with a particular model. You've got the little robot in the virtual there, and you're talking to it back and forth. It might have megabytes or hundreds of megabytes of context, and returning what the robot says, what the chatbot says, along with the current context that the client then passes back in, is just too crazy. If you can't pass in all the context, you can't very well expect every model to get the same result, and rendezvous begins to be not quite the right thing.

Ted: Latency limits may also just be too stringent. We've been assuming that Kafka streaming is the choice here. My own preference, of course, is MapR Streams (Now called MapR Event Store), which implements that same kind of API. The key simplification with that kind of stream is that streams are persistent. That assumption is used in the rendezvous architecture to simplify the world. That's really a good thing, really an important thing, but if you're assuming that you're going to give a response to every request within say a millisecond, assuming that you will always get multiple steps of streams through in just a millisecond may not be such a great idea.

Ted: You may want to be going to an in-memory architecture. You may even want to be using something like the actor system where you directly send requests to models and they directly send, via network or in-memory copies, results back to you. You can get latencies then down into just a few milliseconds, or microseconds even, but you're going to have a much more complex system. It's going to be much harder to separate concerns of reliability and accuracy. It's going to be much harder to guarantee the correctness of your system. If you got to do it, you got to do it, but look at ways of designing the system so you have 10 millisecond, 10s of milliseconds, maybe 100 milliseconds, before you have to respond. That really covers a huge range of where machine learning really, really applies in businesses.

Ted: I'm going to give an example here. This is an example of how the model evaluation could be just too expensive to run full-on rendezvous. This example is a typical architecture for ad targeting. You see this just all over the place. Not quite sure if it's just convergent evolution or if just enough people left the early online marketing organizations so that they all kind of converge to the same architecture, but it's pretty typical. The idea is a request comes into a proxy, a front-end proxy.

Ted: First, step one, that's the little circle with the "1" in it, is get the users profile. That's any variables we know about that user, recent activity, what they've seen lately, what they click on, and so on. That's a pretty sizable retrieval, and so that could take a millisecond or more.

Ted: We use that information, and we use the characteristics of known ads, possibly hundreds of thousands of ads, maybe millions, and what we do is we do a retrieval of some kind, a pre-selection. This is a very, very rough kind of model to try to find say a thousand ads that might be good for this user. We're not going to do an exact probability of click model at that point. We're going to apply the same kind of model to something like latent semantic, indexing, or some indicator tagging on the ads and so on to get a very fast result. You can use a search engine, for instance, even just ordinary Lucene. If you have indicator codes that have been attached to ads, you can do a pre-targeting, which sometimes is even good enough for honest to god targeting, in less than a millisecond.

Ted: Now you got a thousand potential ads and you want to score them for detailed probability of click. You will have a particular model for every ad. It may be very similar to other models, but it's going to have special characteristics. In particular, it's going to have interaction variables against the user profile. You really want a real model there. You're going to have roughly a thousand ads that are going to turn into a thousand model evaluations per ad placement. If you got 10,000, 20,000 ad placements per second, you're going to be looking at 10 million, 20 million model evaluations per second. That's going to be kind of expensive, especially if you start talking about, "Oh I'm going to have 10 versions of all these models." We're talking 200 million model evaluations per second.

Ted: Your person who's signing the checks is probably going to start going, "Are you sure you want to evaluate all that much? Can't you do it a little bit cheaper?" Yeah, maybe you could do some things cheaper. You might be able to do some of the tricks with multiple evaluations in one step in order to get higher parallelism, matrix-matrix operations, but the fact is you're still doing an awful lot of numerical processing there and full-on rendezvous is probably not the right choice.

Ted: Also, frankly, when you've got a thousand ads, if you get probability of click for 800 of them, and for some reason, 200 don't report, either because the server that just happens to be evaluating those 200 models is broken right now and it's going to be back later, 80% results are probably just good enough. You can do a different kind of rough cut rendezvousing where you just say, "What percentage of the results do I need to get to return in order to target this thing?" This is an example where the complexity just kind of exceeds what we're talking about with the simple decision engines that would be totally cool with the rendezvous architecture.

Ted: Here's the slide that explains all that. It comes down to full speculative execution, running all the current models across all the options, is just prohibitive. You might start sucking in some rendezvous card concepts like partial results or default results, or you might say 1% of the things are going to have some speculative execution. That cuts the cost of speculation by 100x. You may only need a few servers to do that speculation, but you're going to have to do something.

Ted: There's ways to do that. You can selectively do speculative execution or just speculate less. You may do all of the other tricks and incorporate some of the rendezvous things. That's kind of like rendezvous-lite.

Ted: You should always remember these conclusions. Computers are just way, way, way faster than they used to be. That means you can do things that by common sense might just seem kind of outrageous. Frankly, computers are outrageous anymore. They are so much faster than a lot of people think, especially if you do a decently efficient implementation. This 100% speculative execution against 10 model versions is no longer nearly as crazy as it used to be. A lot of our intuitions about that were really kind of tribal myths that were formed many, many decades ago when we were talking about Cray supercomputers. Big has gotten really, really pretty small, or small has gotten big. I'm never sure which way that goes, but we can do a lot more computation than we used to do. We should try it anyway because that makes the advantages of things like rendezvous much more in reach. We do have to remember you can't always have it all, even though we'd like to. Our mothers probably told us that when we were little.

Ted: We have a bunch of books that might help you if you're going to build these sort of things. One of them is about geo-distributed data. That's "Data Where You Want It". Ellen, by the way, who came in during this recording, is here. Say hi.

Ellen: Hello.

Ted: And she's co-author on all these books. The "Streaming Architecture" one that has that beautiful picture of the stream going through Seoul is another one which talks about the basic ideas that underline streaming systems in general, including the rendezvous architecture, although it doesn't talk explicitly about rendezvous. We also have some basic machine learning ideas in the "Practical Machine Learning" series about anomaly detection and innovations and recommendation. These talk about what I often call "cheap learning", which are amazingly simple and pragmatic ways to get amazing results from machine learning. We've got blogs, and we've got the most recent book with Ellen about rendezvous, just what we've been talking today. It's called "Machine Learning Logistics". It talks about model management in the real world.

Ted: Always remember there's half the population that is unrepresented and more in tech. There's a lot of unrepresented folks out there, notably women, notably other kinds of people, people of color, people of different preferences. We ought to be fair about that and include all the smart people in the world because we need plenty of them. In fact, Ellen is one of these women in tech. It's a different world than when she started, but we can make it a better world still.

Ted: You can of course engage with us. Several of you have during the real time webinar that we had. Ellen has several of those questions. Ellen, what you got?

Ellen: Okay. I think just quickly before we jump in to the questions, right at the end of the main part of the presentation, you talked about some of the limitations of rendezvous. That's important. It's a really powerful technique that cuts across projects where people are using a wide range of different machine learning tools. I don't know if you mentioned that. This is a much more fundamental aspect. We're talking about the ways in which logistics can be handled well by a platform. We look at the MapR platform because it has those capabilities. It's designed for that. There are other ways that you could do that as a work around, but by having some of those aspects of data and model management handled at the platform level rather than each different application, it frees up your data scientists to do more of what they're actually trying to do. More importantly, it's a fundamental piece that doesn't change with every different tool. There are limitations that doesn't fit every example, but at the same time, it cuts across a lot of different examples.

Ted: To summarize a couple points in there, you can use lots of tools. In fact, that's one of the advantages, that you can use them at the same time, you can even use containers and such, and its very, very broad applicability.

Ellen: A couple of other points I just wanted to mention is kind of a quick summary of the key ideas before we jump into the questions. One of the points you mentioned several times, but keep this in mind, is that rendezvous architecture in the server in production is really about setting up for optimizing reliability. It's not necessarily trying to optimize accuracy. It just doesn't-

Ted: It's building in reliability to systems that you presume at least have some potential to be reliable or accurate.

Ellen: Exactly, but the point is it's not always going ... The first choice, the best choice, is not always just based on accuracy. It is absolutely giving you reliability, but I just wanted to mention that it's not to say that accuracy is still not a concern and you can build in in your SLA how accurate your model needs to be.

Ted: No. What we're doing is we're separating the concern-

Ellen: Exactly, exactly.

Ted: so the data scientists can do those scientists things to get accuracy, and those operations type folks who are all about reliability can achieve their mission without having to get into the data scientists' hair.

Ellen: This is particularly important because we're talking about doing this in situations that are enterprise production situations. This matters for the way business is being done in sites that are depending on this.

Ted: It's particularly important if you've ever seen the hair on most data scientists. You don't want to get into their hair.

Ellen: You don't want to get into their hair.

Ted: Yeah.

Ellen: Just a couple of other points to keep in mind you pointed out that I want people to remember, we're talking about these systems dealing with many models, many models. People who are really new to machine learning or perhaps haven't taken it into production may be thinking in terms of developing the model, evaluating it, getting a good enough model, rolling it into product, but that's not realistic. In real situations.

Ted: No. In fact, in situations where they start seeing what actually happens, they are often appalled, and their managers are even more appalled.

Ellen: Rendezvous is one example of a design that is dealing with those challenges.

Ted: Let's see what those numbers actually are in customers that I've seen. I've seen some guys who were very, very clear on machine learning, good at it, they wind up with 10 models per function. They wind up with 100 or hundreds of functions, and so they almost immediately, within months of going into this full hog, they wind up with thousands of models live at one time. The management burden, the logistics, was quite intimidating.

Ellen: That's a good point to jump into one of the questions that we got from the audience. They made the comment, the observation, that this sounds good, but it can be a little overwhelming. This is a big system. It's got a lot of parts. Also, just this question of there being so many different tools to use, there's so many different parts of this, so how do you get started? Do you have to build the entire thing to get any of this advantage?

Ted: Yeah. I apologized for the mathiness of it. The questioner here was a machine learning expert, so they weren't put off by the math at all. They were put off by the, "Whoa, streams! Whoa, mailbox!", that sort of stuff. We have to remember that jargon is intimidating to people who haven't been doing it, and we want people to be able to specialize. We want data scientists to stay data scientists and not have to become data engineers, software engineers, and we want the software engineers to not have to become data scientists. We want part of the system to be scary, at least so you don't have to go look at it.

Ted: The way to get started is to do it with a minimum viable product, with a very simple proxy, and to join in the conversation. This is still a reference architecture, but you can adopt it in stages. As we develop it further with more and more code, it'll be open source. We can participate and work together on this.

Ellen: That's also an example, as Ted mentioned earlier, this rendezvous architecture is an example, a specialized example, of what you can build when you're working with a stream spaced architecture, when you have streaming microservices. In the larger sense, not just for machine learning. Streaming microservices-

Ted: Keep going.

Ellen: I'm sorry. You had a question?

Ted: No. Just keep going.

Ellen: Okay. The streaming microservices is a very flexible approach. When you're trying to make a shift from your current design to that design, one of the nice things about it is that streaming microservices is so flexible, you don't have to build your entire across your whole organization at one time. You can begin to build parts of projects, new projects, individual parts, and it's very easy to transition to that. Similarly, you can begin to build out rendezvous and adopt that in stages.

Ted: Okay. We also had a question about Kubernetes. Kubernetes is probably the most popular buzzword right now, but it's actually a big deal. Did you want to say anything about that or do you want me to address that?

Ellen: No. Go for it.

Ted: Yeah. Kubernetes is a way of managing clusters of containers. Containers are processes that have an entire environment encapsulated in them so that we can isolate dependencies and run things with different version dependencies. The question was, how would this work on MapR? The answer, of course, is brilliantly. The reason that that's the answer is that containers are being used now on top of MapR. A lot of containers need state, or at least, a lot of applications need state, they need a little bit of caching and a little bit of data kept in memory and so on in order to function efficiently. They need to have streams as output. Putting those state into the containers is a bad thing, but having a container orchestration system like Kubernetes deciding which containers to run and on which machines, and then having those containers have access to files, streams, and tables through standard APIs, is really, really good.

Ted: HDFS is a commonly used large-scale data store. Not really quite a file system, but it doesn't have those standard APIs. That makes a lot of stuff really hard because very, very few machine learning systems are designed with HDFS in mind. They're designed with ordinary file systems in mind. The ability to do standard things in containers that give a state is really, really powerful. MapR is a data platform that lets you do just exactly that.

Ellen: Okay. We have another question from the audience, a very good one. They said, "Are there any examples that show having set up this whole system? In other words, is there a complete example of rendezvous already available?"

Ted: Already running in production?

Ellen: Yeah.

Ted: Piece of rendezvous, absolutely, I have shamelessly learned from people who are smarter than me in certain respects. There are aspects of rendezvous which are definitely in production. Frankly, the whole idea of champion challenger, the idea even of speculative execution, goes back decades in fraud control and machine learning, but it hasn't been applied as systematically as Rendezvous.

Ellen: Rendezvous, what we've proposed here, what Ted's proposed here, this is a very new design, an innovative design. MapR has customers that we've already talked to. They're very interested in this, and they are beginning to build out parts of it now. It isn't something that's just a theory. It is actually something, a real world thing in motion-

Ted: It partly is, yeah.

Ellen: It's not completed yet. I don't think any single system has it completed-

Ted: No.

Ellen: ... and up in production yet, but this is in motion now. It's being developed. We're very excited to see who's going to use this and how they're going to apply it.

Ted: Okay.

Ellen: I think that's it on our questions.

Ted: Or as they say in Hollywood, "That's a wrap."

Ellen: That's a wrap. We really appreciate the audience joining us for this. Please do ... Ted has his email there on the screen. Mine is We would love to hear from you if you have further questions, but also if there are other topics that you would like to see addressed through these webinars, please contact us and suggest topics you'd like to hear. It's really helpful to get your feedback.

Ted: Okay. Also, of course, my Twitter contact is there. Ellen is fairly prominent on Twitter and you can find her easily.

Ellen: It's @ellen_friedman.

Ted: As you might guess from her name, which is pretty cool.

Ellen: Alright. Thanks a lot.

Ted: Thanks very much. And we're out.