Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue #5242 grad_norm and loss is nan #7171

Merged
merged 4 commits into from
Mar 29, 2025

Conversation

Glaceon-Hyy
Copy link
Contributor

This PR addresses a regression introduced in commit 61daaa1 that affects gradient clipping when handling infinite values.

The modified NaN/Inf handling logic in total_norm calculation leads to unexpected behavior:

Original logic (v0.10.3): Converted both NaN and Inf to -1 before entering unscale_and_clip_grads
Post-commit behavior: When total_norm is Inf, inf_or_nan.logical_not() * total_norm produces NaN instead of 0, causing gradient clipping to fail

Here is a minimal reproducible example comparing gradient clipping behavior across implementations.

import torch
import numpy as np
import copy

def test(total_norm):
    test_old_deepspeed(total_norm)
    test_deepspeed(total_norm)
    test_torch(total_norm)
    test_deepspeed_fix(total_norm)

def test_old_deepspeed(total_norm_tensor):
    total_norm = copy.deepcopy(total_norm_tensor)
    # https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1233
    if total_norm == float('inf') or total_norm == -float('inf') or total_norm != total_norm:
        total_norm = torch.tensor(float(-1))
        
    # https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1848
    clip_grad = float(1.0)
    loss_scale = float(1.0)
    combined_scale = loss_scale
    clip = ((total_norm / loss_scale) + 1e-6) / clip_grad
    if clip > 1:
        combined_scale = clip * loss_scale
    print(f"old_deepspeed: {1. / combined_scale}")

def test_deepspeed(total_norm_tensor):
    total_norm = copy.deepcopy(total_norm_tensor)
    # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1710
    norm_is_inf = total_norm.isinf()
    norm_is_nan = total_norm.isnan()
    inf_or_nan = norm_is_nan.logical_or(norm_is_inf)

    err = torch.tensor(-1.0, dtype=torch.float)
    total_norm = inf_or_nan * err + inf_or_nan.logical_not() * total_norm

    # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1970
    clip_grad = float(1.0)
    loss_scale = float(1.0)
    clip = ((total_norm / loss_scale) + 1e-6) / clip_grad
    clip = torch.clamp(clip, min=1.0)
    combined_scale = clip * loss_scale
    print(f"test_deepspeed: {1. / combined_scale}")
    
def test_torch(total_norm_tensor):
    # https://github.com/pytorch/pytorch/blob/v2.6.0/torch/nn/utils/clip_grad.py#L155
    total_norm = copy.deepcopy(total_norm_tensor)
    max_norm = float(1.0)
    clip_coef = max_norm / (total_norm + 1e-6)
    clip_coef_clamped = torch.clamp(clip_coef, max=1.0)
    print(f"torch: {clip_coef_clamped}")

def test_deepspeed_fix(total_norm_tensor):
    total_norm = copy.deepcopy(total_norm_tensor)
    if total_norm.isinf() or total_norm.isnan():
        total_norm = torch.tensor(-1.0, dtype=torch.float)

    # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1970
    clip_grad = float(1.0)
    loss_scale = float(1.0)
    clip = ((total_norm / loss_scale) + 1e-6) / clip_grad
    clip = torch.clamp(clip, min=1.0)
    combined_scale = clip * loss_scale
    print(f"test_deepspeed_fix: {1. / combined_scale}")
    
if __name__ == '__main__':
    print("*****NAN*****")
    test(torch.tensor(float('nan')))
    print("*****INF*****")
    test(torch.tensor(float('inf')))
    print("*****positive*****")
    test(torch.tensor(float(2.0)))

Result:
20250325165135

@tjruwase
Copy link
Contributor

@Glaceon-Hyy, thanks for this PR. Is it possible to convert the repro into a unit test somewhere here?

@tjruwase
Copy link
Contributor

@Glaceon-Hyy, also do you know if setting overlap_comm to False has any effect on this?

@Glaceon-Hyy
Copy link
Contributor Author

I noticed that in commit 61daaa1, even when total_norm produced a NaN instead of the expected -1, the clip calculation (total_norm / self.loss_scale + 1e-6)/self.clip_grad still resulted in NaN. However, the condition nan > 1 evaluates to False, which coincidentally handled the invalid value.

However, in commit 1ef9b02, the torch.clamp(clip, min=1.0) introduced a new issue: when clip is NaN, torch.clamp() returns NaN unchanged. This NaN value then propagates to combined_scale, causing subsequent gradient scaling grad.data.mul_(1. / combined_scale) to produce NaN.

My latest commit addresses this by adding an explicit check to convert NaN values in clip to 1.0 before applying the clamp operation. This prevents NaN propagation while maintaining the desired gradient scaling behavior, ensuring numerical stability in cases where total_norm might become invalid.

@Glaceon-Hyy
Copy link
Contributor Author

@loadams I just force-pushed to fix DCO (Developer Certificate of Origin) issues in the commits. I noticed that my development environment using Magit did not have signed-off-by configured by default.

@tjruwase
Copy link
Contributor

@nelyahu, FYI for any perf impact.

@tjruwase tjruwase added this pull request to the merge queue Mar 28, 2025
@loadams loadams removed this pull request from the merge queue due to a manual request Mar 28, 2025
@hwchen2017 hwchen2017 added this pull request to the merge queue Mar 29, 2025
Merged via the queue into deepspeedai:master with commit 1f70662 Mar 29, 2025
12 checks passed
nelyahu added a commit to nelyahu/DeepSpeed that referenced this pull request Mar 30, 2025
nelyahu added a commit to nelyahu/DeepSpeed that referenced this pull request Mar 30, 2025
@nelyahu
Copy link
Contributor

nelyahu commented Mar 30, 2025

@tjruwase @Glaceon-Hyy @loadams @hwchen2017 can you please review the fix in #7184 ?
i replaced the logical equation with torch.where.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants