Fix ModelCheckpoint file_exists OOM in DDP #21380

littlebullGit · 2025-11-24T05:03:31Z

What does this PR do?

Use [strategy.reduce_boolean_decision] instead of [broadcast] in [ModelCheckpoint.file_exists].
Ensure only global rank 0 touches the filesystem when checking for existing checkpoints.
Avoid broadcast_object_list for a simple boolean in DDP to reduce memory pressure in the checkpoint path.
Add a small DDP test with monitor=None to exercise this path.

Fixes #19674

Motivation and context

In DDP, [strategy.broadcast] is implemented via torch.distributed.broadcast_object_list, which serializes the Python object and can allocate unnecessary GPU memory even for a single boolean. For the “file exists” decision we only need a tiny boolean reduction, so [reduce_boolean_decision] is a better fit and addresses the CUDA OOM reported in #19674 while preserving behavior.

Dependencies

No new runtime dependencies introduced by this PR.
Tests rely on pytorch_lightning_enterprise being available, as required by [tests/tests_pytorch/conftest.py].

Tests

All run inside the project .venv:

python -m pytest tests/tests_pytorch/checkpointing/test_checkpoint_callback_frequency.py
python -m pytest tests/tests_pytorch/checkpointing -k "not legacy_checkpoints"
python -m pytest tests/tests_pytorch/callbacks/test_model_checkpoint_*.py tests/tests_pytorch/trainer/test_trainer.py

📚 Documentation preview 📚: https://pytorch-lightning--21380.org.readthedocs.build/en/21380/

codecov · 2025-11-24T05:21:24Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82%. Comparing base (8f702b3) to head (58d8c50).
⚠️ Report is 3 commits behind head on master.
✅ All tests successful. No failed tests found.

❗ There is a different number of reports uploaded between BASE (8f702b3) and HEAD (58d8c50). Click for more details.

HEAD has 1451 uploads less than BASE

Flag BASE (8f702b3) HEAD (58d8c50)

cpu 355 29

lightning_fabric 89 0

pytest 179 0

python3.12 108 9

python3.12.7 106 9

python3.10 36 2

lightning 179 14

python3.11 72 6

python 33 3

pytorch2.2.2 18 3

pytest-full 176 29

pytorch2.4.1 17 3

pytorch2.3 18 3

pytorch2.1 34 5

pytorch2.9 18 3

pytorch_lightning 87 15

pytorch2.7 18 3

pytorch2.5.1 18 3

pytorch2.8 18 3

pytorch2.6 17 3

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #21380     +/-   ##
=========================================
- Coverage      89%      82%     -7%     
=========================================
  Files         269      266      -3     
  Lines       22050    22008     -42     
=========================================
- Hits        19727    18074   -1653     
- Misses       2323     3934   +1611

justusschock

Great job @littlebullGit ,

One minor comment. Could you also please add a changelog entry?

src/lightning/pytorch/callbacks/model_checkpoint.py

littlebullGit requested review from ethanwharris, justusschock, lantiga and tchaton as code owners November 24, 2025 05:03

github-actions bot added the pl Generic label for PyTorch Lightning package label Nov 24, 2025

justusschock reviewed Nov 24, 2025

View reviewed changes

src/lightning/pytorch/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

Fix ModelCheckpoint.file_exists OOM in DDP

70476a8

littlebullGit force-pushed the fix/19674-model-checkpoint-oom branch 2 times, most recently from 6c828f7 to 359669f Compare November 24, 2025 22:18

Document ModelCheckpoint.file_exists DDP memory fix

f6e48c0

littlebullGit force-pushed the fix/19674-model-checkpoint-oom branch from 359669f to f6e48c0 Compare November 25, 2025 02:03

bhimrazy approved these changes Nov 25, 2025

View reviewed changes

src/lightning/pytorch/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

justusschock reviewed Nov 25, 2025

View reviewed changes

src/lightning/pytorch/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

Update src/lightning/pytorch/callbacks/model_checkpoint.py

58d8c50

justusschock approved these changes Nov 25, 2025

View reviewed changes

justusschock merged commit b09e96e into Lightning-AI:master Nov 25, 2025
85 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix ModelCheckpoint file_exists OOM in DDP #21380

Fix ModelCheckpoint file_exists OOM in DDP #21380

littlebullGit commented Nov 24, 2025 •

edited by github-actions bot

Loading

Uh oh!

codecov bot commented Nov 24, 2025 •

edited

Loading

Uh oh!

justusschock left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix ModelCheckpoint file_exists OOM in DDP #21380

Fix ModelCheckpoint file_exists OOM in DDP #21380

Conversation

littlebullGit commented Nov 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation and context

Dependencies

Tests

Uh oh!

codecov bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

justusschock left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

littlebullGit commented Nov 24, 2025 •

edited by github-actions bot

Loading

codecov bot commented Nov 24, 2025 •

edited

Loading