Open
Description
Describe the bug
Enabling hpZ causes an abnormally large loss.
To Reproduce
Steps to reproduce the behavior:
- Add
"zero_hpz_partition_size": 8,
to deepspeed config file. - Run any SFT training using Axolotl.
Here is my setting:
base_model: mistralai/Mistral-Small-24B-Instruct-2501
plugins:
- axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: true
datasets:
- path: yentinglin/s1K-1.1-trl-format
type: chat_template
chat_template: tokenizer_default
field_messages: messages
message_field_role: role
message_field_content: content
dataset_prepared_path:
val_set_size: 0.01
output_dir: ./placeholder/
sequence_len: 2048
sample_packing: false
eval_sample_packing: False
pad_to_sequence_len: true
eval_steps: 1
gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 5
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 2e-5
max_grad_norm: 1.0
train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false
bfloat16: false
float16: true
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: true
logging_steps: 1
flash_attention: false
xformers_attention: false
sdp_attention: false
warmup_ratio: 0.1
saves_per_epoch: 2
weight_decay: 0.0
deepspeed: deepspeed_configs/zero3_pp.json
The first iteration will result in a loss of ~11, whereas it should normally be around 1.3.
Expected behavior
The loss will gradually decrease from around 1.3.
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
[WARNING] NVIDIA Inference is only supported on Ampere and newer architectures
[WARNING] FP Quantizer is using an untested triton version (3.2.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
INFO:root:x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -c /tmp/tmpb26lo_n0/test.c -o /tmp/tmpb26lo_n0/test.o
INFO:root:x86_64-linux-gnu-gcc /tmp/tmpb26lo_n0/test.o -L/usr/local/cuda -L/usr/local/cuda/lib64 -lcufile -o /tmp/tmpb26lo_n0/a.out
gds .................... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.5
[WARNING] using untested triton version (3.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.10/dist-packages/torch']
torch version .................... 2.5.0a0+e000cf0ad9.nv24.10
deepspeed install path ........... ['/usr/local/lib/python3.10/dist-packages/deepspeed']
deepspeed info ................... 0.16.4, unknown, unknown
torch cuda version ............... 12.6
torch hip version ................ None
nvcc version ..................... 12.6
deepspeed wheel compiled w. ...... torch 2.5, cuda 12.6
shared memory (/dev/shm) size .... 377.15 GB
System info (please complete the following information):
- Ubuntu 22.04
- four machines with x8 V100s each
- Python 3.10
Launcher context
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --gpus-per-node=8
...
export TORCH_CUDA_ARCH_LIST="7.0"
export OMP_NUM_THREADS=1
export MAX_JOBS=16
export PYTHONUNBUFFERED=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export TOKENIZERS_PARALLELISM=false
export NCCL_SOCKET_IFNAME=ib0
export NCCL_IB_HCA=mlx5_0
export GPUS_PER_NODE=8
export NNODES=$SLURM_NNODES
export CMD="torchrun \
--nnodes $NNODES \
--nproc_per_node 8 \
--rdzv_id $SLURM_JOB_ID \
--rdzv_backend c10d \
--rdzv_endpoint "$MASTER_ADDR:$MASTER_PORT" \
-m axolotl.cli.train train.yml \
"
....
srun $SRUN_ARGS --jobid $SLURM_JOBID $SINGULARITY_RUN bash -c '$CMD'
Docker context
Image: alexht/axolotl:24.10-py3
FROM nvcr.io/nvidia/pytorch:24.10-py3
ENV TORCH_CUDA_ARCH_LIST="7.0"
ARG AXOLOTL_EXTRAS=""
ARG AXOLOTL_ARGS=""
RUN apt-get update \
&& apt-get install -y --allow-change-held-packages \
wget git build-essential ninja-build git-lfs libaio-dev pkg-config \
vim curl nano libnccl2 libnccl-dev rsync s3fs \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /workspace
RUN MAX_JOBS=8 pip install -v -U git+https://github.com/facebookresearch/[email protected]#egg=xformers
RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_QUANTIZER=1 \
pip install deepspeed --global-option="build_ext" --global-option="-j8"
RUN git clone https://github.com/alex-ht/axolotl.git
WORKDIR /workspace/axolotl
# If AXOLOTL_EXTRAS is set, append it in brackets
RUN if [ "$AXOLOTL_EXTRAS" != "" ] ; then \
pip install --no-build-isolation -e .[optimizers,ray,$AXOLOTL_EXTRAS] $AXOLOTL_ARGS; \
else \
pip install --no-build-isolation -e .[optimizers,ray] $AXOLOTL_ARGS; \
fi
RUN python scripts/unsloth_install.py | sh
RUN python scripts/cutcrossentropy_install.py | sh
# So we can test the Docker image
#RUN pip install pytest
# fix so that git fetch/pull from remote works
RUN git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*" && \
git config --get remote.origin.fetch
# helper for huggingface-login cli
RUN git config --global credential.helper store