Different Types of Model Deployment Strategies for Machine Learning

A deployment strategy is any technique employed by MLOps teams to successfully launch a new version of the Machine Learning Model they provide. These techniques cover how network traffic in a production environment is transitioned from the old version to the new version of the model. So this is way to change or upgrade an application. The aim is to make the change without downtime in a way that the user barely notices the improvements.

You can check different options for deploying your models, but let’s now focus on some of the strategies to deploy new or updated models. This is important because you want to be able to deploy new models in a way that minimizes risk and downtime while measuring the performance of a new model or a new model version.

As an example, if you have a newer version of a model, you typically don’t want to deploy that model or that new version in a way that disrupts service. You may also want to monitor the performance of that new model version for a period of time in a way that allows you to seamlessly roll back if there is an issue with that new version.

In this blog post, we’ll will see some of the common deployment strategies. We’ll discuss each of these deployment strategies so you’ll see about

  • blue/green deployments
  • shadow/challenger deployments
  • canary deployments
  • A/B testing
  • and finally, multi-armed bandits
Common strategies to deploy new and updated model

Blue/Green Deployments

Let’s start with blue/green deployments. Some of you may be familiar with the concept of blue/green deployments for applications or software. The same concept really applies to models as well. With blue/green deployments, you deploy your new model version to a stack that conserved prediction and response traffic coming into an endpoint. Then when you’re ready to have that new model version actually start to process prediction requests coming in, you swap the traffic to that new model version.

This makes it easy to roll back because if there are issues with that new model or that new model version doesn’t perform well, you can swap traffic back to the previous model version.

Let’s take a closer look at how a blue/green deployment works. With blue/green deployment, you have a current model version running in production. In this case, we have version 1. This accepts 100 percent of the prediction request traffic and responds with prediction responses. When you have a new model version to deploy, in this case, model version 2, you build a new server or container to deploy your model version into. This includes not only the new model version but also the code in the software that’s needed to accept and respond to prediction requests. As you can see in this picture, the new model version is deployed, but the load balancer has not yet been updated to point to that new server hosting the model, so no traffic is hitting that endpoint yet. After the new model version is deployed successfully, you can then shift 100 percent of your traffic to that new cluster serving model version 2 by updating your load balancer. This strategy helps reduce downtime if there’s a need to roll back and swap back to version 1 because you only need to re-point your load balancer back to version 1.

Shift all traffic to new model

The downside to this strategy is that it is 100 percent swap of traffic. So if the new model version, version 2, in this case, is not performing well, then you run the risk of serving bad predictions to 100 percent of your traffic versus a smaller percentage of traffic.

Shadow or Challenger Deployment

Let’s now cover the second type of deployment strategy called shadow or challenger deployment. This is often referred to as challenger models because in this case, you’re running a new model version in production by letting the new version accept prediction requests to see how that new model would respond, but you’re not actually serving the prediction response data from that new model version. This lets you validate the new model version with real traffic without impacting live prediction responses.

Let’s take a look at how it works. You can see with the shadow or challenger deployment strategy, the new model version is deployed and both versions have 100 percent of prediction requests traffic being sent to each version. However, you’ll notice for version 2, only the prediction requests are sent to the model, and you aren’t actually serving prediction responses from model version 2. Responses that would have been sent back for model version 2 are typically captured and then analyzed for whether version 1 or version 2 of the model would have performed better against that full traffic load.

Run multiple versions in parallel with one serving live traffic

This strategy also allows you to minimize the risk of deploying a new model version that may not perform as well as model version 1, and this is because you’re still able to analyze how version 2 of your model would perform without actually serving the prediction responses back from that model version. Then once you are comfortable that model version 2 is performing better, you can actually start to serve prediction responses directly from model version 2 instead of model version 1.

Canary Deployment

The next deployment strategy that we’ll see is canary deployment. With a canary deployment, you split traffic between model versions and target a smaller group to expose that new model version 2. Typically, you’re exposing the select set of users to the new model for a smaller period of time to be able to validate the performance of that new model version before fully deploying that new version out to production.

Canary deployment is a deployment strategy where you’re essentially splitting traffic between two model versions, and again, with canary deployments, you typically expose a smaller specific group to that new model version while model version 1 still serves the majority of your traffic. In the image here you can see that 95 percent of prediction requests and responses are served by Model Version 1 and a smaller set of users are directed to Model Version 2. Canary deployments are good for validating a new model version with a specific or smaller set of users before rolling it out to all users. This is something that can’t be done with a blue-green deployment strategy.

A/B Testing

The next deployment strategy is A/B testing. Canary and A/B testing are similar in that you’re splitting traffic. However, A/B testing is different in that typically you’re splitting traffic between larger groups and for longer periods of time to measure performance of different model versions over time. This split can be done by targeting specific user groups or just by setting a percentage of traffic to randomly distribute to different groups.

Let’s take a closer look at A/B testing. With A/B testing, again, you’re also splitting your traffic to compare model versions. However, here you split traffic between those larger groups for the purpose of comparing different model versions in live production environments. Here, you typically do a larger split across users. So 50 percent one model version, 50 percent the other model version. Then you can also perform A/B testing against more than two model versions as well, although it’s not shown here. While A/B testing seemed similar to canary deployments, A/B testing tests those larger groups, like I mentioned, and typically runs for longer periods of time than canary deployments. A/B tests are focused on gathering live data about different model versions. They typically, again, run for longer periods of time to be able to gather that performance data that is statistically significant enough, which provides that ability to confidently roll out Version 2 to a larger percent of traffic. Because you’re running multiple models for longer periods of time, A/B testing allows you to really validate your different model versions over multiple variations of user behavior.

As an example, you may have a forecasting use case that has seasonality to it. You need to be able to capture how your model performs over changes to the environment over time. So we just covered some of the common static approaches to deploying new or updated models.

Multi-armed Bandits Deployment

All of the above approaches that were covered are static approaches, meaning that you manually identify things like when to swap traffic and how to distribute that traffic. We’ll cover another approach that is more dynamic in nature, meaning that instead of manually identify when and how you distribute traffic, you can take advantage of approaches that incorporate machine learning to automatically decide when and how to distribute traffic between multiple versions of a deployed model.

For this, We’ll see multi-armed bandits. A/B tests are typically fairly static and need to run over a period of time. With this, you do run the potential risk of running with a bad or low-performing model for that same longer period of time. A more dynamic method for testing is multi-armed bandits. Multi-armed bandits use reinforcement learning as a way to dynamically shift traffic to the winning model versions by rewarding the winning model with more traffic but still exploring the nonwinning model versions in the case that those early winners were not the overall best models.

Let’s take a look at what multi-armed bandit strategy testing looks like. In this implementation, you first have an experiment manager, which is basically a model that uses reinforcement learning to determine how to distribute traffic between your model versions. This model chooses the model version to send traffic to based on the current reward metrics and the chosen exploit explore strategy. Exploitation refers to continuing to send traffic to that winning model, whereas exploration allows for routing traffic to other models to see if they can eventually catch up or perform as well as the other model. It will also continue to adjust that prediction traffic to send more traffic to the winning model.

In this example, you can see a new product review and star rating comment, and in this case, your model versions are trying to predict the star rating. You can see Model Version 1 predicted that this was a five-star rating, while Model Version 2 predicted it was a four-star rating. The actual rating was four stars. So in this case Model Version 2 wins. So your multi-arm bandit will reward that model by sending more traffic to Model Version 2. In this article, I tried to summarize various deployment strategies that can be used to minimize downtime and evaluate the performance of a new model with no or minimal impact to your users. All of these concepts are general and they cover machine learning on any platform.