Edge Deployment of ML Model

Edge computing is the deployment of computing and storage resources at the location where data is produced. This ideally puts compute and storage at the same point as the data source at the network edge.

You need to remember that deployment to the edge is an option that is not a cloud specific. But a key consideration is when deploying models closer to your users or in areas with poor network connectivity. In the case of edge deployments, you train your models in another environment like in the cloud and then optimize your model for deployment to edge devices. This process is typically aimed at compiling or packaging your model in a way that is optimized to run at the edge. Which usually means things like reducing the model package size for running on smaller devices.

For example you could use something like Sagemaker Neo to compile your model in a way that is optimized for running at the edge and use cases. Bring your model closer to where it will be used for prediction, so typical use cases would be like manufacturing, where you have cameras on an assembly line. And you need to make real time inferences or in use cases where you need to detect equipment anomalies at the edge. Inference data in this case is often sent back to the cloud for additional analysis or for collection of ground truth data that can then be used to further optimize your model.

I have seen the primary options for deploying a model covering real time Inference, batch inference and deploying models to the edge, the right option to choose depends on several factors. The choice to deploy to the edge is typically an obvious one as there’s edge devices and you might be working with use cases where there is limited network connectivity. You might also be working with internet of things or IOT use cases or use cases where the cost in terms of the time spent in data transfer is not an option even when it’s single digit millisecond response.

When to use real time inference vs batch inference vs edge deployment

Now, the choice between real time inference and batch inference typically comes down to the ways that you need to request and consume predictions in combination with cost. A real time endpoint can serve real time predictions, where the prediction requests sent on input is unique and requires an immediate response with low latency. The trade off is that a persistent endpoint typically cost more because you pay for the compute. And the storage resources that are required to host that model while that endpoint is up and running a batch job in contrast works well when you can batch your data for prediction. And that’s your responses back, now, these responses can then be persisted into a secondary database that can serve real time applications when there is no need for new prediction requests. And responses per transaction, so in this case, you can run batch jobs in a transient environment. Meaning that the compute and storage environments are only active for the duration of your batch job. As a general rule, you should use the option that meets your use case and is the most cost effective.