Skip to content

Conversation

@littlebullGit
Copy link
Contributor

@littlebullGit littlebullGit commented Nov 24, 2025

What does this PR do?

  • Use [strategy.reduce_boolean_decision] instead of [broadcast] in [ModelCheckpoint.file_exists].
  • Ensure only global rank 0 touches the filesystem when checking for existing checkpoints.
  • Avoid broadcast_object_list for a simple boolean in DDP to reduce memory pressure in the checkpoint path.
  • Add a small DDP test with monitor=None to exercise this path.

Fixes #19674

Motivation and context

In DDP, [strategy.broadcast] is implemented via torch.distributed.broadcast_object_list, which serializes the Python object and can allocate unnecessary GPU memory even for a single boolean. For the “file exists” decision we only need a tiny boolean reduction, so [reduce_boolean_decision] is a better fit and addresses the CUDA OOM reported in #19674 while preserving behavior.

Dependencies

  • No new runtime dependencies introduced by this PR.
  • Tests rely on pytorch_lightning_enterprise being available, as required by [tests/tests_pytorch/conftest.py].

Tests

All run inside the project .venv:

  • python -m pytest tests/tests_pytorch/checkpointing/test_checkpoint_callback_frequency.py
  • python -m pytest tests/tests_pytorch/checkpointing -k "not legacy_checkpoints"
  • python -m pytest tests/tests_pytorch/callbacks/test_model_checkpoint_*.py tests/tests_pytorch/trainer/test_trainer.py

📚 Documentation preview 📚: https://pytorch-lightning--21380.org.readthedocs.build/en/21380/

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Nov 24, 2025
@codecov
Copy link

codecov bot commented Nov 24, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82%. Comparing base (8f702b3) to head (58d8c50).
⚠️ Report is 3 commits behind head on master.
✅ All tests successful. No failed tests found.

❗ There is a different number of reports uploaded between BASE (8f702b3) and HEAD (58d8c50). Click for more details.

HEAD has 1451 uploads less than BASE
Flag BASE (8f702b3) HEAD (58d8c50)
cpu 355 29
lightning_fabric 89 0
pytest 179 0
python3.12 108 9
python3.12.7 106 9
python3.10 36 2
lightning 179 14
python3.11 72 6
python 33 3
pytorch2.2.2 18 3
pytest-full 176 29
pytorch2.4.1 17 3
pytorch2.3 18 3
pytorch2.1 34 5
pytorch2.9 18 3
pytorch_lightning 87 15
pytorch2.7 18 3
pytorch2.5.1 18 3
pytorch2.8 18 3
pytorch2.6 17 3
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #21380     +/-   ##
=========================================
- Coverage      89%      82%     -7%     
=========================================
  Files         269      266      -3     
  Lines       22050    22008     -42     
=========================================
- Hits        19727    18074   -1653     
- Misses       2323     3934   +1611     

Copy link
Member

@justusschock justusschock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @littlebullGit ,

One minor comment. Could you also please add a changelog entry?

@littlebullGit littlebullGit force-pushed the fix/19674-model-checkpoint-oom branch 2 times, most recently from 6c828f7 to 359669f Compare November 24, 2025 22:18
@littlebullGit littlebullGit force-pushed the fix/19674-model-checkpoint-oom branch from 359669f to f6e48c0 Compare November 25, 2025 02:03
@justusschock justusschock merged commit b09e96e into Lightning-AI:master Nov 25, 2025
85 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pl Generic label for PyTorch Lightning package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CUDA memory increase (caused CUDA OOM) when saving checkpoint at the train_epoch_end

3 participants