-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue #5242 grad_norm and loss is nan #7171
Conversation
@Glaceon-Hyy, thanks for this PR. Is it possible to convert the repro into a unit test somewhere here? |
@Glaceon-Hyy, also do you know if setting |
I noticed that in commit 61daaa1, even when total_norm produced a NaN instead of the expected -1, the clip calculation (total_norm / self.loss_scale + 1e-6)/self.clip_grad still resulted in NaN. However, the condition nan > 1 evaluates to False, which coincidentally handled the invalid value. However, in commit 1ef9b02, the torch.clamp(clip, min=1.0) introduced a new issue: when clip is NaN, torch.clamp() returns NaN unchanged. This NaN value then propagates to combined_scale, causing subsequent gradient scaling grad.data.mul_(1. / combined_scale) to produce NaN. My latest commit addresses this by adding an explicit check to convert NaN values in clip to 1.0 before applying the clamp operation. This prevents NaN propagation while maintaining the desired gradient scaling behavior, ensuring numerical stability in cases where total_norm might become invalid. |
Signed-off-by: yueyang.hyy <[email protected]>
Signed-off-by: yueyang.hyy <[email protected]>
Signed-off-by: yueyang.hyy <[email protected]>
972c8c0
to
632fefe
Compare
@loadams I just force-pushed to fix DCO (Developer Certificate of Origin) issues in the commits. I noticed that my development environment using Magit did not have signed-off-by configured by default. |
632fefe
to
49f38a1
Compare
@nelyahu, FYI for any perf impact. |
…edai#7171)" This reverts commit 1f70662.
…edai#7171)" This reverts commit 1f70662. Signed-off-by: Nadav Elyahu <[email protected]>
@tjruwase @Glaceon-Hyy @loadams @hwchen2017 can you please review the fix in #7184 ? |
This PR addresses a regression introduced in commit 61daaa1 that affects gradient clipping when handling infinite values.
The modified NaN/Inf handling logic in total_norm calculation leads to unexpected behavior:
Original logic (v0.10.3): Converted both NaN and Inf to -1 before entering unscale_and_clip_grads
Post-commit behavior: When total_norm is Inf, inf_or_nan.logical_not() * total_norm produces NaN instead of 0, causing gradient clipping to fail
Here is a minimal reproducible example comparing gradient clipping behavior across implementations.
Result:
