[BUG] Enabling hpZ causes an abnormally large loss.

**Describe the bug**
Enabling hpZ causes an abnormally large loss.

**To Reproduce**
Steps to reproduce the behavior:
1. Add `"zero_hpz_partition_size": 8,` to deepspeed config file.
2. Run any SFT training using Axolotl.
Here is my setting:
```yaml
base_model: mistralai/Mistral-Small-24B-Instruct-2501

plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: true

datasets:
  - path: yentinglin/s1K-1.1-trl-format
    type: chat_template
    chat_template: tokenizer_default
    field_messages: messages
    message_field_role: role
    message_field_content: content
dataset_prepared_path:
val_set_size: 0.01
output_dir: ./placeholder/

sequence_len: 2048
sample_packing: false
eval_sample_packing: False
pad_to_sequence_len: true
eval_steps: 1

gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 5
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 2e-5
max_grad_norm: 1.0

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false
bfloat16: false
float16: true

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: true
logging_steps: 1
flash_attention: false
xformers_attention: false
sdp_attention: false

warmup_ratio: 0.1
saves_per_epoch: 2
weight_decay: 0.0
deepspeed: deepspeed_configs/zero3_pp.json
```

The first iteration will result in a loss of ~11, whereas it should normally be around 1.3.

**Expected behavior**
The loss will gradually decrease from around 1.3.

**ds_report output**
```
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
 [WARNING]  NVIDIA Inference is only supported on Ampere and newer architectures
 [WARNING]  FP Quantizer is using an untested triton version (3.2.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
INFO:root:x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -c /tmp/tmpb26lo_n0/test.c -o /tmp/tmpb26lo_n0/test.o
INFO:root:x86_64-linux-gnu-gcc /tmp/tmpb26lo_n0/test.o -L/usr/local/cuda -L/usr/local/cuda/lib64 -lcufile -o /tmp/tmpb26lo_n0/a.out
gds .................... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.5
 [WARNING]  using untested triton version (3.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.10/dist-packages/torch']
torch version .................... 2.5.0a0+e000cf0ad9.nv24.10
deepspeed install path ........... ['/usr/local/lib/python3.10/dist-packages/deepspeed']
deepspeed info ................... 0.16.4, unknown, unknown
torch cuda version ............... 12.6
torch hip version ................ None
nvcc version ..................... 12.6
deepspeed wheel compiled w. ...... torch 2.5, cuda 12.6
shared memory (/dev/shm) size .... 377.15 GB
```

**System info (please complete the following information):**
 - Ubuntu 22.04
 - four machines with x8 V100s each
 - Python 3.10

**Launcher context**
```
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --gpus-per-node=8
...
export TORCH_CUDA_ARCH_LIST="7.0"
export OMP_NUM_THREADS=1
export MAX_JOBS=16
export PYTHONUNBUFFERED=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export TOKENIZERS_PARALLELISM=false
export NCCL_SOCKET_IFNAME=ib0
export NCCL_IB_HCA=mlx5_0
export GPUS_PER_NODE=8
export NNODES=$SLURM_NNODES

export CMD="torchrun \
  --nnodes $NNODES \
  --nproc_per_node 8 \
  --rdzv_id $SLURM_JOB_ID \
  --rdzv_backend c10d \
  --rdzv_endpoint "$MASTER_ADDR:$MASTER_PORT" \
  -m axolotl.cli.train train.yml \
  "
....
srun $SRUN_ARGS --jobid $SLURM_JOBID $SINGULARITY_RUN bash -c '$CMD'
```

**Docker context**
Image: alexht/axolotl:24.10-py3

```Dockerfile
FROM nvcr.io/nvidia/pytorch:24.10-py3
ENV TORCH_CUDA_ARCH_LIST="7.0"
ARG AXOLOTL_EXTRAS=""
ARG AXOLOTL_ARGS=""

RUN apt-get update \
    && apt-get install -y --allow-change-held-packages \
       wget git build-essential ninja-build git-lfs libaio-dev pkg-config \
       vim curl nano libnccl2 libnccl-dev rsync s3fs \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /workspace

RUN MAX_JOBS=8 pip install -v -U git+https://github.com/facebookresearch/xformers.git@v0.0.28.post3#egg=xformers 
RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_QUANTIZER=1 \
    pip install deepspeed --global-option="build_ext" --global-option="-j8"

RUN git clone https://github.com/alex-ht/axolotl.git

WORKDIR /workspace/axolotl

# If AXOLOTL_EXTRAS is set, append it in brackets
RUN if [ "$AXOLOTL_EXTRAS" != "" ] ; then \
        pip install --no-build-isolation -e .[optimizers,ray,$AXOLOTL_EXTRAS] $AXOLOTL_ARGS; \
    else \
        pip install --no-build-isolation -e .[optimizers,ray] $AXOLOTL_ARGS; \
    fi

RUN python scripts/unsloth_install.py | sh
RUN python scripts/cutcrossentropy_install.py | sh

# So we can test the Docker image
#RUN pip install pytest

# fix so that git fetch/pull from remote works
RUN git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*" && \
    git config --get remote.origin.fetch

# helper for huggingface-login cli
RUN git config --global credential.helper store
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Enabling hpZ causes an abnormally large loss. #7164

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Enabling hpZ causes an abnormally large loss. #7164

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions