MetricCollection 1.7.1 causing OOM error in DDP setup #20748
Unanswered
Vatsal-Malaviya
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Setup
lightning - 2.4.0
torchmetrics - 1.7.1
torch - 2.4.0+cu124
Single Node 2 NVIDIA A30
Okay, here's a summary of the issue and debugging process with torchmetrics.MetricCollection and DDP:
Context:
Problem:
Debugging & Findings:
Conclusion / Root Cause Hypothesis:
The evidence strongly indicates the issue is not the size of the actual metric states being stored on the GPU (as the single-GPU test proves they fit easily). The problem lies specifically within the DDP synchronization mechanism triggered by MetricCollection.compute()
Model Script for Reference
Beta Was this translation helpful? Give feedback.
All reactions