Checkpointing for machine learning

3.7.2. Checkpointing for machine learning#

Checkpointing describes the process of storing the current state of the program to be able to continue the work at a different time and/or place. Having checkpoints of workflows is helpful in case of failures or crashes and can save time and resource when not everything has to be re-run.

Checkpointing an arbitrary workflow however can be almost endlessly complex. However, very repetitive workflows can be checkpointed with little effort. Common examples in the context of high energy physics are event loops and machine learning workflows. For event loops, Checkpointing is not so interesting, since this work is trivially parallelizable: the workload can be split in independent batches to minimize the risk of losing work.

Training machine learning models however can not be parallelized, since one epoch depends on the epoch before. Additionally, the training time can be very long, which makes crashes more severe and time constraints on host sites a big issue. Therefore, Checkpointing can be very useful.

A tool to help with Checkpointing can be found on github.

To learn how to use it, please refer to the documentation and consult the examples.

Stuck? We can help!

If you get stuck or have any questions to the online book material, the #starterkit-workshop channel in our chat is full of nice people who will provide fast help.

Refer to Collaborative Tools. for other places to get help if you have specific or detailed questions about your own analysis.

Improving things!

If you know how to do it, we recommend you to report bugs and other requests with GitLab. Make sure to use the documentation-training label of the basf2 project.

If you just want to give us feedback, please open a GitLab issue and add the label online_book to it.

Please make sure to be as precise as possible to make it easier for us to fix things! So for example:

typos (where?)
missing bits of information (what?)
bugs (what did you do? what goes wrong?)
too hard exercises (which one?)
etc.

If you are familiar with git and want to create your first merge request for the software, take a look at How to contribute. We’d be happy to have you on the team!

Quick feedback!

Do you want to leave us some feedback? Open a GitLab issue and add the label online_book to it.