Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

safe_get_full_grad & safe_set_full_grad #7117

Open
ProjectDisR opened this issue Mar 9, 2025 · 3 comments
Open

safe_get_full_grad & safe_set_full_grad #7117

ProjectDisR opened this issue Mar 9, 2025 · 3 comments

Comments

@ProjectDisR
Copy link

ProjectDisR commented Mar 9, 2025

deepspeed 0.15.3
zero 3 is used

For "safe_get_full_grad", does it return the same gradient values on each process/rank?

As for "safe_set_full_grad", should it be called on all the processes/ranks? or just one of them is enough?
If it's the former one, users will need to ensure gradient values to be set on each process/rank are the same?

Also, which float type should be used for "safe_set_full_grad"? any way to check this?

@tjruwase
Copy link
Contributor

For "safe_get_full_grad", does it return the same gradient values on each process/rank?

Yes. This API is called after reduction and so all DP ranks have the same gradient value: https://deepspeed.readthedocs.io/en/latest/zero3.html#debugging

As for "safe_set_full_grad", should it be called on all the processes/ranks? or just one of them is enough?

These APIs should be called by all the DP ranks to avoid hangs.

If it's the former one, users will need to ensure gradient values to be set on each process/rank are the same?

Yes, users have freedom/responsibility for correct usage of these APIs. Normally gradient values are the same across DP ranks after reduction.

Also, which float type should be used for "safe_set_full_grad"? any way to check this?

The type should match the gradient accumulation data type configured by the user: https://deepspeed.readthedocs.io/en/latest/zero3.html#deepspeed.utils.safe_set_full_grad

@jinghanjia
Copy link

@tjruwase Thank you for your explanation. I have an additional question regarding the safe_get_local_grad and safe_set_local_grad functions. Are these functions designed to return the gradient values specific to each process or rank (i.e., the gradients computed on that particular rank before reduction)? If not, could you please explain the differences between the local gradients and the aggregated gradients? Thank you!

@tjruwase
Copy link
Contributor

@jinghanjia, please see motivation of the local APIs here: #4681

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants