Fix issue #5242 grad_norm and loss is nan #7171

Glaceon-Hyy · 2025-03-25T08:59:13Z

This PR addresses a regression introduced in commit 61daaa1 that affects gradient clipping when handling infinite values.

The modified NaN/Inf handling logic in total_norm calculation leads to unexpected behavior:

Original logic (v0.10.3): Converted both NaN and Inf to -1 before entering unscale_and_clip_grads
Post-commit behavior: When total_norm is Inf, inf_or_nan.logical_not() * total_norm produces NaN instead of 0, causing gradient clipping to fail

Here is a minimal reproducible example comparing gradient clipping behavior across implementations.

import torch
import numpy as np
import copy

def test(total_norm):
    test_old_deepspeed(total_norm)
    test_deepspeed(total_norm)
    test_torch(total_norm)
    test_deepspeed_fix(total_norm)

def test_old_deepspeed(total_norm_tensor):
    total_norm = copy.deepcopy(total_norm_tensor)
    # https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1233
    if total_norm == float('inf') or total_norm == -float('inf') or total_norm != total_norm:
        total_norm = torch.tensor(float(-1))
        
    # https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1848
    clip_grad = float(1.0)
    loss_scale = float(1.0)
    combined_scale = loss_scale
    clip = ((total_norm / loss_scale) + 1e-6) / clip_grad
    if clip > 1:
        combined_scale = clip * loss_scale
    print(f"old_deepspeed: {1. / combined_scale}")

def test_deepspeed(total_norm_tensor):
    total_norm = copy.deepcopy(total_norm_tensor)
    # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1710
    norm_is_inf = total_norm.isinf()
    norm_is_nan = total_norm.isnan()
    inf_or_nan = norm_is_nan.logical_or(norm_is_inf)

    err = torch.tensor(-1.0, dtype=torch.float)
    total_norm = inf_or_nan * err + inf_or_nan.logical_not() * total_norm

    # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1970
    clip_grad = float(1.0)
    loss_scale = float(1.0)
    clip = ((total_norm / loss_scale) + 1e-6) / clip_grad
    clip = torch.clamp(clip, min=1.0)
    combined_scale = clip * loss_scale
    print(f"test_deepspeed: {1. / combined_scale}")
    
def test_torch(total_norm_tensor):
    # https://github.com/pytorch/pytorch/blob/v2.6.0/torch/nn/utils/clip_grad.py#L155
    total_norm = copy.deepcopy(total_norm_tensor)
    max_norm = float(1.0)
    clip_coef = max_norm / (total_norm + 1e-6)
    clip_coef_clamped = torch.clamp(clip_coef, max=1.0)
    print(f"torch: {clip_coef_clamped}")

def test_deepspeed_fix(total_norm_tensor):
    total_norm = copy.deepcopy(total_norm_tensor)
    if total_norm.isinf() or total_norm.isnan():
        total_norm = torch.tensor(-1.0, dtype=torch.float)

    # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1970
    clip_grad = float(1.0)
    loss_scale = float(1.0)
    clip = ((total_norm / loss_scale) + 1e-6) / clip_grad
    clip = torch.clamp(clip, min=1.0)
    combined_scale = clip * loss_scale
    print(f"test_deepspeed_fix: {1. / combined_scale}")
    
if __name__ == '__main__':
    print("*****NAN*****")
    test(torch.tensor(float('nan')))
    print("*****INF*****")
    test(torch.tensor(float('inf')))
    print("*****positive*****")
    test(torch.tensor(float(2.0)))

Result:

tjruwase · 2025-03-25T15:23:21Z

@Glaceon-Hyy, thanks for this PR. Is it possible to convert the repro into a unit test somewhere here?

tjruwase · 2025-03-25T15:24:33Z

@Glaceon-Hyy, also do you know if setting overlap_comm to False has any effect on this?

deepspeed/runtime/zero/stage_1_and_2.py

Glaceon-Hyy · 2025-03-26T04:16:54Z

I noticed that in commit 61daaa1, even when total_norm produced a NaN instead of the expected -1, the clip calculation (total_norm / self.loss_scale + 1e-6)/self.clip_grad still resulted in NaN. However, the condition nan > 1 evaluates to False, which coincidentally handled the invalid value.

However, in commit 1ef9b02, the torch.clamp(clip, min=1.0) introduced a new issue: when clip is NaN, torch.clamp() returns NaN unchanged. This NaN value then propagates to combined_scale, causing subsequent gradient scaling grad.data.mul_(1. / combined_scale) to produce NaN.

My latest commit addresses this by adding an explicit check to convert NaN values in clip to 1.0 before applying the clamp operation. This prevents NaN propagation while maintaining the desired gradient scaling behavior, ensuring numerical stability in cases where total_norm might become invalid.

Signed-off-by: yueyang.hyy <[email protected]>

Glaceon-Hyy · 2025-03-27T03:29:04Z

@loadams I just force-pushed to fix DCO (Developer Certificate of Origin) issues in the commits. I noticed that my development environment using Magit did not have signed-off-by configured by default.

tjruwase · 2025-03-28T18:03:07Z

@nelyahu, FYI for any perf impact.

…edai#7171)" This reverts commit 1f70662.

…edai#7171)" This reverts commit 1f70662. Signed-off-by: Nadav Elyahu <[email protected]>

nelyahu · 2025-03-30T08:07:13Z

@tjruwase @Glaceon-Hyy @loadams @hwchen2017 can you please review the fix in #7184 ?
i replaced the logical equation with torch.where.

Glaceon-Hyy requested review from tjruwase and tohtana as code owners March 25, 2025 08:59

loadams reviewed Mar 25, 2025

View reviewed changes

deepspeed/runtime/zero/stage_1_and_2.py Outdated Show resolved Hide resolved

Glaceon-Hyy added 3 commits March 27, 2025 11:24

Fix issue deepspeedai#5242 grad_norm and loss is nan

38952e6

Signed-off-by: yueyang.hyy <[email protected]>

Fix format

b6e78d9

Signed-off-by: yueyang.hyy <[email protected]>

handle total_norm invalid value

49f38a1

Signed-off-by: yueyang.hyy <[email protected]>

Glaceon-Hyy force-pushed the fix_grad_norm branch from 972c8c0 to 632fefe Compare March 27, 2025 03:24

Glaceon-Hyy force-pushed the fix_grad_norm branch from 632fefe to 49f38a1 Compare March 27, 2025 03:56

Merge branch 'master' into fix_grad_norm

409f9cc

tjruwase approved these changes Mar 28, 2025

View reviewed changes

tjruwase added this pull request to the merge queue Mar 28, 2025

loadams removed this pull request from the merge queue due to a manual request Mar 28, 2025

hwchen2017 added this pull request to the merge queue Mar 29, 2025

Merged via the queue into deepspeedai:master with commit 1f70662 Mar 29, 2025
12 checks passed

nelyahu added a commit to nelyahu/DeepSpeed that referenced this pull request Mar 30, 2025

Revert "Fix issue deepspeedai#5242 grad_norm and loss is nan (deepspe…

f525c62

…edai#7171)" This reverts commit 1f70662.

nelyahu added a commit to nelyahu/DeepSpeed that referenced this pull request Mar 30, 2025

Revert "Fix issue deepspeedai#5242 grad_norm and loss is nan (deepspe…

1d4f5ca

…edai#7171)" This reverts commit 1f70662. Signed-off-by: Nadav Elyahu <[email protected]>

nelyahu mentioned this pull request Mar 30, 2025

Reland perf fix for nan inf check #7184

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue #5242 grad_norm and loss is nan #7171

Fix issue #5242 grad_norm and loss is nan #7171

Glaceon-Hyy commented Mar 25, 2025

tjruwase commented Mar 25, 2025

tjruwase commented Mar 25, 2025

Glaceon-Hyy commented Mar 26, 2025

Glaceon-Hyy commented Mar 27, 2025

tjruwase commented Mar 28, 2025

nelyahu commented Mar 30, 2025

Fix issue #5242 grad_norm and loss is nan #7171

Fix issue #5242 grad_norm and loss is nan #7171

Conversation

Glaceon-Hyy commented Mar 25, 2025

tjruwase commented Mar 25, 2025

tjruwase commented Mar 25, 2025

Glaceon-Hyy commented Mar 26, 2025

Glaceon-Hyy commented Mar 27, 2025

tjruwase commented Mar 28, 2025

nelyahu commented Mar 30, 2025