Model lineage essentially refers to understanding and tracking all of the inputs that were used to create a specific version of a model.
Content Overview
There are typically many inputs that go into training a specific version of a model. These inputs include things like the version of the data that was used to train the model in combination with the versions of the code and the hyperparameters that were used to build the model.
However, inputs also includes things like the versions of the algorithms or the frameworks that were used. Depending on how you’re building your model, this can also include things like the version of your docker images that were used for training, as well as different versions of packages or libraries that were used. You can see a lot goes into model lineage, but it’s basically all of the data that tells us exactly how a specific version of a model was built.
For each version of the trained Model
- Version(s) of data used
- version(s) of code or hyperparameters used
- Version(s) of algorithms and framework
- Version(s) of training docker image
- version(s) of package/libraries
Model Lineage Example
In this case, you can see all of the model inputs that were used to create version 26 of this model. Each of these inputs has a version or multiple versions associated with it. An input may even have some additional metadata as well. As an example, for your Python code, you probably have a commit hash for the source code commit that was used to commit this particular piece of code. But you may also want to capture additional metadata, like the name of the source code repository, so that all of these inputs together are the main data points that allow you to capture the information and provide a complete picture about how this model was actually built. You also typically want to capture information about the trained model artifact itself as well. Things like the evaluation metrics for that particular version, as well as the location of the model artifacts. As you can see, this is a lot of information to track. As you can see, this is a lot of information to track. Where does that information about model lineage get stored, and how do you capture all of this information as part of your machine learning workflow? This is where model registry comes in.
What is Model Registry
A model registry is a central store for managing model metadata and model artifacts. When you incorporate a model registry into your automated pipeline, it can provide traceability and auditability for your models, allowing you to effectively manage models, especially when you begin to manage at scale and you have tens or hundreds or even thousands of models. A model registry also gives you the visibility into how each of the model versions was built. It also typically includes other metadata, such as information about environments where a particular version of a model is deployed into.
What is Artifact Tracking
Artifacts can be the output of a step or a task that can be consumed by the next step in your pipeline. Or they can even be deployed directly for consumption by other applications or systems.
In this view, you see your machine learning workflow with corresponding tasks. Let’s assume that these tasks have been automated, and you’re now orchestrating these tasks into your machine learning pipeline. For each task, you have a consumable artifact that becomes the input into the next task. Each of these artifacts has different versions associated with them.
For your data task, your process training dataset is an artifact from this task. In your model-building task, your model artifact that is produced becomes input into your model deployment task. A machine learning pipeline really provides a consistent mechanism to capture the metadata and the versions of the inputs that are consumed by each step, as well as the artifacts that are produced by each step. But why is all this so important? There are a few reasons why it’s important. But operational efficiency is one key reason. When you need to go and debug something, it’s important to know what version is deployed at any given time, as well as what versions of the inputs were used to create that deployable artifact or the consumable artifact. It’s also important for the reliability of your workload. Because what if, for example, a human goes in and inadvertently deletes a live endpoint? Without knowing exactly how that endpoint was built, it’s difficult, if not impossible, to recover that endpoint without disruption in your service. Hope now you understand model lineage and artifact tracking, as well as how machine learning pipelines help create a scalable, consistent mechanism for model lineage and artifact tracking.