-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Fix ModelCheckpoint file_exists OOM in DDP #21380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ModelCheckpoint file_exists OOM in DDP #21380
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests.
Additional details and impacted files@@ Coverage Diff @@
## master #21380 +/- ##
=========================================
- Coverage 89% 82% -7%
=========================================
Files 269 266 -3
Lines 22050 22008 -42
=========================================
- Hits 19727 18074 -1653
- Misses 2323 3934 +1611 |
justusschock
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job @littlebullGit ,
One minor comment. Could you also please add a changelog entry?
6c828f7 to
359669f
Compare
359669f to
f6e48c0
Compare
What does this PR do?
broadcast_object_listfor a simple boolean in DDP to reduce memory pressure in the checkpoint path.monitor=Noneto exercise this path.Fixes #19674
Motivation and context
In DDP, [strategy.broadcast] is implemented via
torch.distributed.broadcast_object_list, which serializes the Python object and can allocate unnecessary GPU memory even for a single boolean. For the “file exists” decision we only need a tiny boolean reduction, so [reduce_boolean_decision] is a better fit and addresses the CUDA OOM reported in #19674 while preserving behavior.Dependencies
pytorch_lightning_enterprisebeing available, as required by [tests/tests_pytorch/conftest.py].Tests
All run inside the project
.venv:python -m pytest tests/tests_pytorch/checkpointing/test_checkpoint_callback_frequency.pypython -m pytest tests/tests_pytorch/checkpointing -k "not legacy_checkpoints"python -m pytest tests/tests_pytorch/callbacks/test_model_checkpoint_*.py tests/tests_pytorch/trainer/test_trainer.py📚 Documentation preview 📚: https://pytorch-lightning--21380.org.readthedocs.build/en/21380/