Skip to content

[BUG] Enabling hpZ causes an abnormally large loss. #7164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alex-ht opened this issue Mar 21, 2025 · 0 comments
Open

[BUG] Enabling hpZ causes an abnormally large loss. #7164

alex-ht opened this issue Mar 21, 2025 · 0 comments
Labels
bug Something isn't working training

Comments

@alex-ht
Copy link

alex-ht commented Mar 21, 2025

Describe the bug
Enabling hpZ causes an abnormally large loss.

To Reproduce
Steps to reproduce the behavior:

  1. Add "zero_hpz_partition_size": 8, to deepspeed config file.
  2. Run any SFT training using Axolotl.
    Here is my setting:
base_model: mistralai/Mistral-Small-24B-Instruct-2501

plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: true

datasets:
  - path: yentinglin/s1K-1.1-trl-format
    type: chat_template
    chat_template: tokenizer_default
    field_messages: messages
    message_field_role: role
    message_field_content: content
dataset_prepared_path:
val_set_size: 0.01
output_dir: ./placeholder/

sequence_len: 2048
sample_packing: false
eval_sample_packing: False
pad_to_sequence_len: true
eval_steps: 1

gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 5
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 2e-5
max_grad_norm: 1.0

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false
bfloat16: false
float16: true

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: true
logging_steps: 1
flash_attention: false
xformers_attention: false
sdp_attention: false

warmup_ratio: 0.1
saves_per_epoch: 2
weight_decay: 0.0
deepspeed: deepspeed_configs/zero3_pp.json

The first iteration will result in a loss of ~11, whereas it should normally be around 1.3.

Expected behavior
The loss will gradually decrease from around 1.3.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
 [WARNING]  NVIDIA Inference is only supported on Ampere and newer architectures
 [WARNING]  FP Quantizer is using an untested triton version (3.2.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
INFO:root:x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -c /tmp/tmpb26lo_n0/test.c -o /tmp/tmpb26lo_n0/test.o
INFO:root:x86_64-linux-gnu-gcc /tmp/tmpb26lo_n0/test.o -L/usr/local/cuda -L/usr/local/cuda/lib64 -lcufile -o /tmp/tmpb26lo_n0/a.out
gds .................... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.5
 [WARNING]  using untested triton version (3.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.10/dist-packages/torch']
torch version .................... 2.5.0a0+e000cf0ad9.nv24.10
deepspeed install path ........... ['/usr/local/lib/python3.10/dist-packages/deepspeed']
deepspeed info ................... 0.16.4, unknown, unknown
torch cuda version ............... 12.6
torch hip version ................ None
nvcc version ..................... 12.6
deepspeed wheel compiled w. ...... torch 2.5, cuda 12.6
shared memory (/dev/shm) size .... 377.15 GB

System info (please complete the following information):

  • Ubuntu 22.04
  • four machines with x8 V100s each
  • Python 3.10

Launcher context

#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --gpus-per-node=8
...
export TORCH_CUDA_ARCH_LIST="7.0"
export OMP_NUM_THREADS=1
export MAX_JOBS=16
export PYTHONUNBUFFERED=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export TOKENIZERS_PARALLELISM=false
export NCCL_SOCKET_IFNAME=ib0
export NCCL_IB_HCA=mlx5_0
export GPUS_PER_NODE=8
export NNODES=$SLURM_NNODES

export CMD="torchrun \
  --nnodes $NNODES \
  --nproc_per_node 8 \
  --rdzv_id $SLURM_JOB_ID \
  --rdzv_backend c10d \
  --rdzv_endpoint "$MASTER_ADDR:$MASTER_PORT" \
  -m axolotl.cli.train train.yml \
  "
....
srun $SRUN_ARGS --jobid $SLURM_JOBID $SINGULARITY_RUN bash -c '$CMD'

Docker context
Image: alexht/axolotl:24.10-py3

FROM nvcr.io/nvidia/pytorch:24.10-py3
ENV TORCH_CUDA_ARCH_LIST="7.0"
ARG AXOLOTL_EXTRAS=""
ARG AXOLOTL_ARGS=""

RUN apt-get update \
    && apt-get install -y --allow-change-held-packages \
       wget git build-essential ninja-build git-lfs libaio-dev pkg-config \
       vim curl nano libnccl2 libnccl-dev rsync s3fs \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /workspace

RUN MAX_JOBS=8 pip install -v -U git+https://github.com/facebookresearch/[email protected]#egg=xformers 
RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_QUANTIZER=1 \
    pip install deepspeed --global-option="build_ext" --global-option="-j8"

RUN git clone https://github.com/alex-ht/axolotl.git

WORKDIR /workspace/axolotl

# If AXOLOTL_EXTRAS is set, append it in brackets
RUN if [ "$AXOLOTL_EXTRAS" != "" ] ; then \
        pip install --no-build-isolation -e .[optimizers,ray,$AXOLOTL_EXTRAS] $AXOLOTL_ARGS; \
    else \
        pip install --no-build-isolation -e .[optimizers,ray] $AXOLOTL_ARGS; \
    fi

RUN python scripts/unsloth_install.py | sh
RUN python scripts/cutcrossentropy_install.py | sh

# So we can test the Docker image
#RUN pip install pytest

# fix so that git fetch/pull from remote works
RUN git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*" && \
    git config --get remote.origin.fetch

# helper for huggingface-login cli
RUN git config --global credential.helper store
@alex-ht alex-ht added bug Something isn't working training labels Mar 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

1 participant