Machine Learning Logistics

by Ted Dunning and Ellen Friedman

The Rendezvous Architecture for Machine Learning

Rendezvous architecture is a design to handle the logistics of machine learning in a flexible, responsive, convenient, and realistic way. Specifically, rendezvous provides a way to do the following:

  • Collect data at scale from a variety of sources and preserve raw data so that potentially valuable features are not lost
  • Make input and output data available to many independent applications (consumers) even across geographically distant locations, on premises, or in the cloud
  • Manage multiple models during development and easily roll into production
  • Improve evaluation methods for comparing models during development and production, including use of a reference model for baseline successful performance
  • Have new models poised for rapid deployment

The rendezvous architecture works in concert with your organization’s global data fabric. It doesn’t solve all of the challenges of logistics and model management, but it does provide a pragmatic and powerful design that greatly improves the likelihood that machine learning will deliver value from big data.

In this chapter, we present in detail an explanation of what motivates this design and how it delivers the advantages we’ve mentioned. We start with the shortcomings of previous designs and follow a design path to a more flexible approach.

A Traditional Starting Point

When building a machine learning application, it is very common to want a discrete response system. In such a system, you pass all of the information needed to make some decision, and a machine learning model responds with a decision. The key characteristic is this synchronous response style.

For now, we can assume that there is nothing outside of the request needed by the model to make this decision. Figure 3-1 shows the basic architecture of a system like this.

A discrete response system, one in which a model responds to requests with decisions and poses problems that underline the need for the rendezvous architecture.
Figure 3-1. Figure 3-1. A discrete response system, one in which a model responds to requests with decisions and poses problems that underline the need for the rendezvous architecture.

This is very much like the first version for the henhouse monitoring system described in Why Model Management?. The biggest virtue of such a system is its stunning simplicity, which is obviously desirable.

That is its biggest vice, as well.

Problems crop up when we begin to impose some of the other requirements that are inherent in deploying machine learning models to production. For instance, it is common in such a system to require that we can run multiple models at the same time on the exact same data in order to compare their speed and accuracy. Another common requirement is that we separate the concerns of decision accuracy from system reliability guarantees. We obviously can’t completely separate these, but it would be nice if our data scientists who develop the model could focus on science-y things like accuracy, with only broad-brush requirements around topics like redundancy, running multiple models, speed and absolute stability.

Similarly, it would be nice if the ops part of our DataOps team could focus more on guaranteeing that the system behaves like a solid, well-engineered piece of software that isolates models from one another, always returns results on time, restarts failed processes, transparently rolls to new versions, and so on. We also want a system that meets operational requirements like deadlines and makes it easy to decide, manage, and change which models are in play.

Why a Load Balancer Doesn’t Suffice

The first thought that lots of people have when challenged with getting better operational characteristics from a basic discrete decision system (as in Figure 3-1) is to simply replicate the basic decision engine and put a load balancer in front of these decision systems. This is the standard approach for microservices, as shown in Figure 3-2. Using a load balancer solves some problems, but definitely not all of them, especially not in the context of learned models. With a load balancer, you can start and stop new models pretty easily, but you don’t easily get latency guarantees, the ability to compare models on identical inputs, nor do you get records of all of the requests with responses from all live models.

A load balancer, in which each request is sent to one of the active models at a time, is an improvement but lacks key capabilities of the rendezvous style
Figure 3-2. Figure 3-2. A load balancer, in which each request is sent to one of the active models at a time, is an improvement but lacks key capabilities of the rendezvous style.

Using a load balancer in front of a microservice works really well in many domains such as profile lookup, web servers, and content engines. So, why doesn’t it work well for machine learning models?

The basic problem comes down to some pretty fundamental discrepancies between the nature and life cycle of conventional software services and services based on machine learning (or lots of other data-intensive services):

  • The difference between revisions of machine learning models are often subtle, and we typically need to give the exact same input to multiple revisions and record identical results when making a comparison, which is inherently statistical in nature. Revisions of the software in a conventional microservice don’t usually require that kind of extended parallel operation and statistical comparison.
  • We often can’t predict which of several new techniques will yield viable improvements within realistic operational settings. That means that we often want to run as many as a dozen versions of our model at a time. Running multiple versions of a conventional service at the same time is usually considered a mistake rather than a required feature.
  • In a conventional DevOps team, we have a mix of people who have varying strengths in (software) development or operations. Typically, the software engineers in the team aren’t all that bad at operations and the operations specialists understand software engineering pretty well. In a DataOps team, we have a broader mix of data scientists, software engineers, and operations specialists. Not only does a DataOps team cover more ground than a DevOps team, there is typically much less overlap in skills between data scientists and software engineers or operations engineers. That makes rolling new versions much more complex socially than with conventional software.
  • Quite frankly, because of the wider gap in skills between data scientists and software or ops engineers, we need to allow for the fact that models will typically not be implemented with as much software engineering rigor as we might like. We also must allow for the fact that the framework is going to need to provide for a lot more data rigor than most software does in order to satisfy the data science part of the team.

These problems could be addressed by building a new kind of load balancer and depending heavily on the service discovery features of frameworks such as Kubernetes, but there is a much simpler path. That simpler path is to use a stream-first architecture such as the rendezvous architecture.

A Better Alternative: Input Data as a Stream

Now, we take the first step underlying the new design. As What Matters in Model Management explains, message streams in the style of Apache Kafka, including MapR Streams, are an ideal construct here because stream consumers control what and when they listen for data (pull style). That completely sidesteps the problem of service discovery and avoids the problem of making sure all request sources send all transactions to all models. Using a stream to receive the requests, as depicted in Figure 3-3, also gives us a persistent record of all incoming requests, which is really helpful for debugging and postmortem purposes. Remember that the models aren’t necessarily single processes.

Receiving requests via a stream makes it easy to distribute a request to all live models, but we need more machinery to get responses back to the source of requests
Figure 3-3. Figure 3-3. Receiving requests via a stream makes it easy to distribute a request to all live models, but we need more machinery to get responses back to the source of requests.

But we immediately run into the question that if we put the requests into a stream, how will the results come back? With the original discrete decision architecture in Figure 3-1, there is a response for every request and that response can naturally include the results from the model. On the other hand, if we send the requests into a stream and evaluate those requests with lots of models, the insertion into the input stream will complete before any model has even looked at the request. Even worse, with multiple models all producing results at different times, there isn’t a natural way to pick which result we should return, nor is there any obvious way to return it. These additional challenges motivate the rendezvous design.

Rendezvous Style

We can solve these problems with two simple actions. First, we can put a return address into each request. Second, we can add a process known as a rendezvous server that selects which result to return for each request. The return address specifies how a selected result can be returned to the source of the request. A return address could be an address of an HTTP address connected to a REST server. Even better, it can be the name of a message stream and a topic. Whatever works best for you is what it needs to be.


Using a rendezvous style works only if the streaming and processing elements you are using are compatible with your latency requirements.

For persistent message queues, such as Kafka and MapR Streams, and for processing frameworks, such as Apache Flink or even just raw Java, a rendezvous architecture will likely work well—down to around single millisecond latencies.

Conversely, as of this writing, microbatch frameworks such as Apache Spark Streaming will just barely be able to handle latencies as low as single digit seconds (not milliseconds). That might be acceptable, but often it will not be. At the other extreme, if you need to go faster than a few milliseconds, you might need to use nonpersistent, in-memory streaming technologies. The rendezvous architecture will still apply.


The key distinguishing feature in a rendezvous architecture is how the rendezvous server reads all of the requests as well as all of the results from all of the models and brings them back together.

Figure 3-4 illustrates how a rendezvous server works. The rendezvous server uses a policy to select which result to anoint as “official” and writes that official result to a stream. In the system shown, we assume that the return address consists of a topic and request identifier and that the rendezvous server should write the results to a well-known stream with the specified topic. The result should contain the request identifier to the process sending the request in the first place since it has the potential to send overlapping requests.

The core rendezvous design. There are additional nuances, but this is the essential shape of the architecture
Figure 3-4. Figure 3-4. The core rendezvous design. There are additional nuances, but this is the essential shape of the architecture.

Internally, the rendezvous server works by maintaining a mailbox for each request it sees in the input stream. As each of the models report results into the scores stream, the rendezvous server reads these results and inserts them into the corresponding mailbox. Based on the amount of time that has passed, the priority of each model and possibly even a random number, the rendezvous server eventually chooses a result for each pending mailbox and packages that result to be sent as a response to the return address in the original request.

One strength of the rendezvous architecture is that a model can be “warmed up” before its outputs are actually used so that the stability of the model under production conditions and load can be verified. Another advantage is that models can be “deployed” or “undeployed” simply by instructing the rendezvous server to stop (or start) ignoring their output.

Related to this, the rendezvous can make guarantees about returning results that the individual models cannot make. You can, for instance, define a policy that specifies how long to wait for the output of a preferred model. If at least one of the models is very simple and reliable, albeit a bit less accurate, this simple model can be used as a backstop answer so that if more sophisticated models take too long or fail entirely, we can still produce some kind of answer before a deadline. Sending the results back to a highly available message stream as shown in Figure 3-4 also helps with reliability by decoupling the sending of the result by the rendezvous server from the retrieving the result by the original requestor.

Message Contents

The messages between the components in a rendezvous architecture are mostly what you would expect, with conventional elements like timestamp, request id, and request or response contents, but there are some message elements that might surprise you on first examination.

The messages in the system need to satisfy multiple kinds of goals that are focused around operations, good software engineering and data science. If you look at the messages from just one of these points of view, some elements of the messages may strike you as unnecessary.

All of the messages include a timestamp, message identifier, provenance, and diagnostics components. This makes the messages look roughly like the following if they are rendered in JSON form:

    timestamp: 1501020498314,
    messageId: "2a5f2b61fdd848d7954a51b49c2a9e2c",
    provenance: { ... },
    diagnostics: { ... },
    ... application specific data here ..

The first two common message fields are relatively self-explanatory. The timestamp should be in milliseconds, and the message identifier should be long enough to be confident that it is unique. The one shown here is 128 bits long.

The provenance section provides a history of the processing elements, including release version, that have touched this message. It also can contain information about the source characteristics of the request in case we want to drill down on aggregate metrics. This is particularly important when analyzing the performance and impact of different versions of components or different sources of requests. Including the provenance information also allows limited trace diagnostics to be returned to the originator of the request without having to look up any information in log files or tables.

The amount of information kept in the provenance section should be relatively limited by default to the information that you really need to return to the original caller. You can increase the level of detail by setting parameters in the diagnostics section. If tracing is enabled for a request, the provenance section will contain the trace parent identifier to allow latency traces to be knit back together when you want to analyze what happened during a particular query. Depending on your query rates, the fraction of queries that have latency tracing turned on will vary. It might be all queries or it might be a tiny fraction.

The diagnostics section contains flags that can override various environmental settings. These overrides can force more logging or change the fallback schedule that the rendezvous server uses to select different model outputs to be returned to the original requestor. If desired, you can even use the diagnostics element to do failure injection. The faults injected could include delaying a model result or simulating a fault in a system component or model. Fault injection is typically only allowed in QA systems for obvious reasons.

Request Specific Fields

Beyond the common fields, every request messages includes a return address and the model inputs. These inputs are augmented with the external state information and are given identically to every model. Note that some model inputs such as images or videos can be too large or complex to carry in the request directly. In such cases, a reference to the input can be passed instead of the actual input data. The reference is often a filename if you have a distributed file system with a global namespace, or an object reference if you are using a system like S3.

The return address can be something as simple as topic name in a well known message stream. Using a stream to deliver results has lots of advantages, such as automatically logging the delivery of results so it is generally preferred over mechanisms such as REST endpoints.

Output Specific Fields

The output from the models consists of score messages that include the original request identifier as well as a new message identifier and the model outputs themselves. The rendezvous server uses the original request identifier to collect together results for a request in anticipation of returning a response. The return address doesn’t need to be in the score messages, because the rendezvous server will get that from the original request.

The result message has whatever result is selected by the rendezvous server and very little else other than diagnostic and provenance data. The model outputs can have many forms depending on the details of how the model actually works.

Data Format

The data format you use for the messages between components in a rendezvous architecture doesn’t actually matter as much as you might think especially given the heat that is generated whenever you bring data format conventions up in a discussion. The cost of moving messages in inefficient formats including serializing and deserializing data is typically massively overshadowed by the computations involved in evaluating a model. We have shown the messages as if they were JSON, but that is just because JSON is easy to read in a book. Other formats such as Arrow, Avro, Protobuf, or OJAI are more common in production. There is a substantial advantage to self-describing formats that don’t require a schema repository, but even that isn’t a showstopper.

That being said, there is huge value in consensus about messaging formats. It is far better to use a single suboptimal format everywhere than to split your data teams into factions based on format. Pick a format that everybody likes, or go with a format that somebody else already picked. Either way, building consensus is the major consideration and dominates anything but massive technical considerations.

Stateful Models

The basic rendezvous architecture allows for major improvements in the management of models that are pure functions, that is, functions that always give the same output if given the same input.

Some models are like that. For instance, the TensorChicken model described in Why Model Management? will recognize the same image exactly the same way no matter what else it has seen lately. Machine translation and speech recognition systems are similar. Only deploying a new model changes the results.

Other models are definitely not stateless. In general, we define a stateful model as any model whose output in response to a request cannot be computed just from that one request. In addition, it is useful to talk about “internal state,” which can be computed from the past history of requests, and “external state,” which uses information from the outside world or some other information system.

Card velocity is a great example of internal state. Many credit card fraud models look at where and when recent transactions happened. This allows them to determine how fast the card must have moved to get from one transaction to the next. This card velocity can help detect cloned cards. Card velocity is a stateful feature because it doesn’t just depend on the current transaction; it also depends on the previous card-present transaction. Stateful data like this can depend on a single entity like a user, website visitor, or card holder, or it can be a collective value such as the number of transactions in the last five minutes or the expected number of requests estimated from recent rates. These are all examples of internal state because every model could compute its own version of internal state from the sequence of requests completely independently of any other model.

External state looks very different. For instance, the current temperature in the location where a user is located is an example of external state. A customer’s current balance may also be an external state variable if a model doesn’t get to see every change of balance. Outputs from other models are also commonly treated as external state.

In a rendezvous model, it is a best practice to add all external state to the requests that are sent to all models. This allows all external state to be recorded. Since external state can’t be re-created from the input data alone, archiving all external state used by the models is critical for reproducibility.

For internal state, on the other hand, you have a choice. You can compute the internal state variables as if they were external state and add them to the requests for all models. This is good if multiple models are likely to use the same variables. Alternatively, you can have each model compute its own internal state. This is good if the computation of internal state is liable to change. Figure 3-5 shows how to arrange the computations.

With stateful models, all dependence on external state should be positioned in the main rendezvous flow so that all models get exactly the same state
Figure 3-5. Figure 3-5. With stateful models, all dependence on external state should be positioned in the main rendezvous flow so that all models get exactly the same state.

The point here is that all external state computation should be external to all models. Forms of internal state that have stable and commonly used definitions can be computed and shared in the same way as external state or not, according to preference. As we have mentioned, the key rationale for dealing with these two kinds of state in this way is reproducibility. Dealing with state as described here means that we can reproduce the behavior of any model by using only the data that the decoy model has recorded and nothing else. The idea of having such a decoy model that does nothing but archive common inputs is described more fully in the next section.

The Decoy Model

Nothing is ever quite as real as real data. As a result, recording live input data is extraordinarily helpful for developing and evaluating machine learning models. This doesn’t seem at first like it would be much of an issue, and it is common for new data scientists to make the mistake of trusting that a database or log file is a faithful record of what data was or would have been given to a model. The fact is, however, that all kinds of distressingly common events can conspire to make reconstructed input data different from the real thing. The upshot is that you really can’t expect data collected as a byproduct of another function to be both complete and correct. Just as unit tests and integration tests in software engineering are used to isolate different kinds of error to allow easier debugging, recording real data can isolate data errors from modeling errors.

The simplest way to ensure that you are seeing exactly what the models are seeing is to add what is called a decoy model into your system. A decoy model looks like a model and accepts data like any other model, but it doesn’t emit any results (that is, it looks like a duck and floats like a duck, but it doesn’t quack). Instead, it just archives the inputs that it sees. In a rendezvous architecture, this is really easy, and it is also really easy to be certain that the decoy records exactly what other models are seeing because the decoy reads from the same input stream.

Figure 3-6 shows how a decoy model is inserted into a rendezvous architecture.

A decoy model inserted into rendezvous architecture doesn’t produce results. It just archives inputs that all models see, including external state
Figure 3-6. Figure 3-6. A decoy model inserted into rendezvous architecture doesn’t produce results. It just archives inputs that all models see, including external state.

A decoy model is absolutely crucial when the model inputs contain external state information from data sources such as a user profile database. When this happens, this external data should be added directly into the model inputs using a preprocessing step common to all models, as described in the previous section. If you force all external state into all requests and use a decoy model, you can know exactly what external state the models saw. Handling external state like this can resolve otherwise intractable problems with race conditions between updates to external state and the evaluation of a model. Having a decoy that gets exactly the same input as every other model avoids that kind of question.

The Canary Model

Another best practice is to always run a canary model even if newer models provide more accuracy or better performance. The point of the canary is not really to provide results (indeed, the rendezvous server will probably be configured to always ignore the canary). Instead, the canary is intended to provide a scoring baseline to detect shifts in the input data and as a comparison benchmark for other models. Figure 3-7 shows how the canary model can be used in the rendezvous architecture.

Integrating a canary model into the rendezvous architecture provides a useful comparison for baseline behavior both for input data and other models
Figure 3-7. Figure 3-7. Integrating a canary model into the rendezvous architecture provides a useful comparison for baseline behavior both for input data and other models.

For detecting input shifts, the distribution of outputs for the canary can be recorded and recent distributions can be compared to older distributions. For simple scores, distribution of score can be summarized over short periods of time using a sketch like the t-digest. These can be aggregated to form sketches for any desired period of time, and differences can be measured (for more information on this, see Meta Analytics). We then can monitor this difference over time, and if it jumps up in a surprising way, we can declare that the canary has detected a difference in the inputs.

We also can compare the canary directly to other models. As a bonus, we not only can compare aggregated distributions, we can use the request identifier to match up all the model results and compare each result against all others.


For more ideas about how to compare models, see Machine Learning Model Evaluation.

It might seem a bit surprising to compare new models against a very old model instead of each other. But the fact of the canary’s age is exactly why it is so useful. Over time, every new model will have been compared to the canary during the preproduction checkout and warm-up period. This means that the DataOps teams will have developed substantial experience in comparing models to the canary and will be able to spot anomalies quickly. When new models are compared to new models, however, you aren’t quite as sure what to expect precisely because neither of the models has much track record.

Adding Metrics

As with any production system, reporting metrics on who is doing what to whom, and how often, is critical to figuring out what is really going on in the system. Metrics should not be (but often are) an afterthought to be added after building an entire system. Good metrics are, however, key to diagnosing all kinds of real-world issues that crop up, be they model stability issues, deployment problems, or problems with the data logistics in a system, and they should be built in from the beginning.

With ordinary microservices, the primary goal of collecting metrics is to verify that a system is operating properly and, if not, to diagnose the problem. Problems generally have mostly to do with whether or not a system meets service-level agreements.

With machine learning models, we don’t just need to worry about operational metrics (did the model answer, was it quick enough?); we also need to worry about the accuracy of the model (when it did answer, did it give the right answer?). Moreover, we usually expect a model to have some error rate, and it is normal for accuracy to degrade over time, especially when the model has real-world adversaries, as in fraud detection or intrusion detection. In addition, we need to worry about whether the input data has changed in some way, possibly by data going missing or by a change in the distribution of incoming data. We go into more detail on how to look for data changes in Meta Analytics.

To properly manage machine learning models, we must collect metrics that help us to understand our input data and how our models are performing on both operational and accuracy goals. Typically, this means that we need to record operational metrics to answer operational questions and need to record scores for multiple models to answer questions about accuracy. Overall, there are three kinds of questions that need to be answered:

  • Across all queries, possibly broken down by tags on the requests, what are the aggregate values of certain quantities like number of requests and what is the distribution of quantities like latency? What are the properties of our inputs and outputs in terms of distribution? We answer these questions with aggregate metrics.
  • On some subset of queries, what are the specific times taken for every step of evaluation? Note that the subset can be all queries or just a small fraction of all queries. We can answer these questions with latency traces.
  • On a large number of queries, what are the exact model outputs broken down by model version for each query? This will help us compare accuracy of one model versus another. We answer these questions by archiving inputs with the decoy server and outputs in the scores stream.

The first kind of metrics helps us with the overall operation of the system. We can find out whether we are meeting our guarantees, how traffic volumes are changing, how to size the system going forward, and help diagnose system-level issues like bad hardware or noisy neighbors. We also can watch for unexpected model performance or input data changes. It might be important to be able to inject tags into this kind of metrics so that we can drill into these aggregates to measure performance for special customers, queries that came from particular sources, or where we might have some other hint that there is a class of requests to pay special attention to. We talk more about analyzing aggregated metrics in Meta Analytics.

The second kind of metrics helps us drill into the specific timing details of the system. This can help us debug issues in rendezvous policies and find hot-spots in certain kinds of queries. These trace-based measurements are particularly powerful if we can trigger the monitoring on a request-by-request basis. That allows us to run low volume tests on problematic requests in a production setting without incurring the data volume costs of recording traces on the entire production volume. For systems with low to moderate volumes of requests, the overhead of latency tracing doesn’t matter, and it can be left on for all requests. Meta Analytics provides more information about latency tracing.

The third kind of metrics is good for measuring which models are more accurate than others. In this section, we talk about how the rendezvous helps gather the data for that kind of comparison, but check out Machine Learning Model Evaluation for more information on how models can actually be compared to each other.

In recording metrics, there are two main options. One is to insert data inline into messages as they traverse the system. The virtue here is that we can tell everything that has happened to every request. This is great for detecting problems with any particular message request.

The alternative of putting all of the metrics into a side-channel has almost exactly the opposite virtues and vices relative to inline metrics. Tracking down information on a single request requires a join or search, but aggregation is easier and the amount of additional data in the request themselves is very small. Metric storage and request archives can be managed independently. Security for metrics is separated from security for requests.

For machine learning applications, it is often a blend of both options that is best. There are a few metrics related to processing time, to which model’s results were used, and so on, that can be very useful to the process that issued the request in the first place; those should be put inline in the response. Other metrics (probably larger by far) are of more interest to the operations specialists in the DataOps team, and thus separate storage is probably warranted for those metrics. It is still important to carry a key to join the different metrics together.

Anomaly Detection

On seriously important production systems, you also should be running some form of automated analytics on your logs. It can help dramatically to do some anomaly detection, particularly on latency for each model step. We described how to automate much of the anomaly detection process in our book Practical Machine Learning: A New Look at Anomaly Detection (O’Reilly, 2014). Those methods are very well suited to the components of a rendezvous architecture. The basic idea is that there are patterns in the metrics that you collect that can be automatically detected and give you an alert when something gets seriously out of whack. The model latencies, for instance, should be nearly constant. The number of requests handled per second can be predicted based on request rates over the last few weeks at a similar time of day.

Rule-Based Models

Nothing says that we have to use machine learning to create models, even if the title of this book seems to imply that you do.

In fact, it can be quite useful to build some models by hand using a rule-based system. Rule-based models can be useful whenever there are hard-and-fast requirements (often of a regulatory nature) that require exact compliance. For example, if you have a requirement that a customer can’t be called more often than once every 90 days, rules can be a good option. On the other hand, rules are typically a very bad way to detect subtle relationships in the data; thus, most fraud detection models are built using machine learning. It’s fairly common to combine these types of systems in order to get some of the best out of both types. For instance, you can use rules to generate features for the machine learning system to use. This can make learning much easier. You could also use rules to post-process the output of a machine learning system into specific actions to be taken.

That said, all of the operational aspects of the rendezvous model apply just as well if you are using rule-based models or machine learning models, or even if you are using composites. The core ideas of injecting external state early, recording inputs using a decoy, comparing to a canary model and using streams to connect components to a rendezvous server still apply.

Using Pre-Lined Containers

All models in a rendezvous server should be containerized to make management easier. It doesn’t much matter which container or orchestration framework you pick, but containerization pays big benefits. The most important of these is that containers provide a consistent runtime environment for models as well as a way to version control configuration. This means that data scientists can have a prototyping environment that is functionally identical to the production environment. Containers are also useful because the operations specialists in a DataOps team can provide scaffolding in a standardized container to requests from input streams and to put results into output streams. This scaffolding can also handle all metrics reporting, which, frankly, isn’t likely to be a priority for the data science part of the team—at least not until they need to debug a nasty problem involving interactions between supposedly independent systems.

Having container specifications prebuilt can make it enormously easier to support different machine learning toolsets, as well. More and more, the trend by DataOps teams is to use “one of each” as a modeling technology strategy. With containers already set up for each class of model, having diversity in modeling technologies becomes an advantage rather than a challenge.

Containers also facilitate the use of source code version tracking to record changes in configuration and environment of models. Losing track of configuration and suffering from surprising environmental changes is a very common cause of mysterious problems. Assumptions about configuration and environments can make debugging massively more difficult. Falsifying such assumptions early by taking control of these issues is key to successfully replicating (or, better, avoiding) production problems.