Long-running generation jobs can fail after hours of processing due to:
- Network interruptions
- API rate limits
- Out of memory errors
- System crashes
- User interruption (Ctrl+C)
Result: All progress lost, must restart from beginning.
Example scenario:
- Generating 10,000 samples
- Fails after 8,000 samples (3 hours in)
- No checkpoint → restart from 0
- Total wasted: 3 hours + compute + API costs
What about Incremental Checkpointing?