safe_get_full_grad & safe_set_full_grad #7117

ProjectDisR · 2025-03-09T10:10:19Z

deepspeed 0.15.3
zero 3 is used

For "safe_get_full_grad", does it return the same gradient values on each process/rank?

As for "safe_set_full_grad", should it be called on all the processes/ranks? or just one of them is enough?
If it's the former one, users will need to ensure gradient values to be set on each process/rank are the same?

Also, which float type should be used for "safe_set_full_grad"? any way to check this?

tjruwase · 2025-03-12T13:57:38Z

For "safe_get_full_grad", does it return the same gradient values on each process/rank?

Yes. This API is called after reduction and so all DP ranks have the same gradient value: https://deepspeed.readthedocs.io/en/latest/zero3.html#debugging

As for "safe_set_full_grad", should it be called on all the processes/ranks? or just one of them is enough?

These APIs should be called by all the DP ranks to avoid hangs.

If it's the former one, users will need to ensure gradient values to be set on each process/rank are the same?

Yes, users have freedom/responsibility for correct usage of these APIs. Normally gradient values are the same across DP ranks after reduction.

Also, which float type should be used for "safe_set_full_grad"? any way to check this?

The type should match the gradient accumulation data type configured by the user: https://deepspeed.readthedocs.io/en/latest/zero3.html#deepspeed.utils.safe_set_full_grad

jinghanjia · 2025-03-15T03:28:30Z

@tjruwase Thank you for your explanation. I have an additional question regarding the safe_get_local_grad and safe_set_local_grad functions. Are these functions designed to return the gradient values specific to each process or rank (i.e., the gradients computed on that particular rank before reduction)? If not, could you please explain the differences between the local gradients and the aggregated gradients? Thank you!

tjruwase · 2025-03-21T22:12:19Z

@jinghanjia, please see motivation of the local APIs here: #4681

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

safe_get_full_grad & safe_set_full_grad #7117

safe_get_full_grad & safe_set_full_grad #7117

ProjectDisR commented Mar 9, 2025 •

edited

Loading

tjruwase commented Mar 12, 2025

jinghanjia commented Mar 15, 2025

tjruwase commented Mar 21, 2025

safe_get_full_grad & safe_set_full_grad #7117

safe_get_full_grad & safe_set_full_grad #7117

Comments

ProjectDisR commented Mar 9, 2025 • edited Loading

tjruwase commented Mar 12, 2025

jinghanjia commented Mar 15, 2025

tjruwase commented Mar 21, 2025

ProjectDisR commented Mar 9, 2025 •

edited

Loading