Machine Learning Checkpointing

Checkpoint Deep Learning Models or Machine Learning Models

Machine learning training is typically a long-time intensive process. It’s not uncommon to see training jobs running over multiple hours or even multiple days. If these long-running training jobs stop for any reason such as a power failure, or oils fault, or any other unforeseen error, then you’ll have to start the training job from the very beginning. This leads to lost productivity. Even if you don’t encounter any unforeseen errors, there might be situations where you want to start a training job from a known state, to try out new experiments. In these situations, you will use machine learning checkpointing.

Checkpointing is a way to save the current state of a running training job so the training job, if it is stopped, can be resumed from a known state. Checkpoints are basically snapshots of model in training and include details like model architecture, which allows you to recreate the model training once it stopped, also includes model weights that have been learned in the training process so far. Also, training configuration such as number of epochs that have been executed, and the optimizer used, and the loss observed so far in training, and other metadata information.

Finally, the checkpoints also include information such as optimizer state. This optimizer state allows you to easily resume the training job from where it has stopped. When configuring your new training job with checkpointing take two things into consideration, one is the frequency of checkpointing, and the second is the number of checkpoint files you are saving each time. If you have a high frequency of checkpointing and saving several different files each time, then you are quickly using up the storage. However, this high frequency and high number of checkpoints you’re processing, this state will allow you to resume your training jobs without losing any training state information.

On the other hand, if the frequency and the number of checkpoints you’re saving each time is low, you are definitely saving on the storage space, but there is a possibility that some of the training state has been lost when the training job is stopped. When configuring your training jobs with these parameters, take the balance of your storage costs versus your productivity requirements into consideration.

For example Amazon SageMaker Managed Spot capability allows you to save training costs. Managed Spot is based on the concept of Spot Instances that offer speed and unused capacity to users at discount prices. SageMaker Managed Spot uses these Spot Instances for hyperparameter tuning and training and leverages machine learning checkpointing to resume training jobs easily. Here’s how it works. You start a training job on a Docker container on a Spot Instance. Here, you use a training script called train.python. Since Spot Instances can be preempted and terminated with just a two-minute notice, it is important that your train.py file implement the ability to save checkpoints, and the ability to resume from checkpoints. SageMaker Managed Spot does the remaining. It automatically backs up the checkpoints to an S3 bucket. In case a Spot Instance is terminated because of lack of capacity, SageMaker Managed Spot continues to pull for additional capacity. Once the additional capacity becomes available, a new Spot Instance is created to resume your training and the service automatically transfers all the dataset as well as the checkpoints that are saved into the S3 bucket into your new Instance so that training can be resumed. A key thing for you to take advantage of Managed Spot capability is implementing your training script so that they can periodically save the checkpoints and have the ability to resume from a saved checkpoint.

Summary

  • Saves state of model during training
  • Checkpoints: Snapshot of the model
    • Model Architecture
    • Model weights
    • Training Configuration
    • Optimizer State
  • Frequency and number of checkpoints