Model Deployment Overview - Real Time Inference vs Batch Inference

When deploying your AI model during production, you need to consider how it will make predictions. The two main processes for AI models are:

Content Overview

Batch inference

Batch inference, sometimes called offline inference. It is a simpler inference process that helps models to run in timed intervals and business applications to store predictions. It is an asynchronous process that bases its predictions on a batch of observations. The predictions are stored as files or in a database for end users or business applications.

Real-time (or interactive) inference

Real-time inference, frees the model to make predictions at any time and trigger an immediate response. This pattern can be used to analyze streaming and interactive application data. Real-time, or interactive, inference is architecture where model inference can be triggered at any time, and an immediate response is expected. This pattern can be used to analyze streaming data, interactive application data, and more.

There are multiple options to deploy your models and we’ll explore all of these at different levels in this article, focusing a bit more on the cloud specific deployment options. Being able to choose the right deployment option that best meets your use case is critical when looking at practical data science for cloud deployments. There are two general options that you’ll see about including real time Inference and batch inference.

Real-time Inference

Deploying a model for real time inference means deploying it to a persistent hosted environment that’s able to serve requests for prediction and provide prediction responses back in real time or near real time. This involves exposing an endpoint that has his serving stack that can accept and respond to requests. A serving stack needs to include a proxy that can accept incoming requests and direct them to an application that then uses your Inference code to interact with your model.

This is a good option when you need to have low latency combined with the ability to serve new prediction requests that come in. So, some example use cases here would be fraud detection where you may need to be able to identify whether an incoming transactions is potentially fraudulent in near real time or product recommendations. Here you want to predict the appropriate products based on a customer’s current search history or a customer’s current shopping cart.

Real-time Inference – Product Review Sentiment Analysis Example

For example, let’s take a look at how a real time persistent endpoint would apply to a product review use case. In this case, let’s assume you need to identify whether a product review is negative and immediately notify a customer support engineer about negative reviews. So that they can proactively reach out to the customer right away here you have some type of web application that a consumer enters their product review into. Then that web application or secondary process called by that web application coordinates a call to your real time end point that serves your model with the new product review text. The hosted model then returns a prediction. So in this case it would be a negative class for sentiment that can then be used to initiate a back end process that opens a high severity support ticket to the customer support engineer. Given that your objective here is to have quick customer support response. You can see where you would need to have that model consistently available through a real time endpoint that’s able to serve your prediction requests that come in and serve your response traffic.

Batch Inference

Let’s now look at batch inference and see how it compares to real time inference with batch inference. You aren’t hosting a model that persists and can serve requests for prediction as they come in. Instead, your batch in those requests for prediction, running a batch job against those batch requests and then out putting your prediction responses typically is batch records as well. Then once you have your prediction responses, they can then be used in a number of different ways. Those prediction responses are often used for reporting or are persisted into a secondary data store for use by other applications or for additional reporting. Use cases that are focused on forecasting are a natural fit for batch inference. So say you’re doing sales forecasting where you typically use batch sales data over a period of time to come up with new sales forecast.

Batch Inference – Product Review Sentiment Analysis Example

In this case, you’d use batch jobs to process those prediction requests and potentially store those predictions for additional visibility or analysis, let’s go back to the product review case. So let’s say your ultimate business goal here is to be able to identify vendors that have potential quality issues by detecting trends for negative product reviews per vendor.

So in this case, you don’t need a real time end point, but you would use a batch inference job to take a batch of product review data. Then run batch jobs at a reasonable frequency that you identify that can take all of those product reviews on input. Process those predictions and that output that data just as the prediction request data is a set of batch records on input. The prediction responses that are output to the model are also collected as a collection of batch records. That data could then be persisted so that your analysts could aggregate the data. Run reports to identify any potential issues with vendors that have a large number of negative reviews with your batch job. These jobs aren’t persisted so they run for only the amount of time that it takes to process those batch requests on input.

Online versus batch prediction

Online prediction	Batch prediction
Optimized to minimize the latency of serving predictions.	Optimized to handle a high volume of instances in a job and to run more complex models.
Returns as soon as possible.	Asynchronous request.

Read More…

Model Deployment Overview – Real Time Inference vs Batch Inference