Multi-Node DeviceStatsMonitor #19700
Unanswered
oabuhamdan
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Greetings,
I run multi-node training with torchrun command, CPU device, and with 3 nodes.
I notice that the DeviceStatsMonitor only capture stats for Node with node_rank 0.
Is there a way to get these data captured for all nodes?
I am open to a solution with creating my own callbacks.
Beta Was this translation helpful? Give feedback.
All reactions