We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe the bug Enabling hpZ causes an abnormally large loss.
To Reproduce Steps to reproduce the behavior:
"zero_hpz_partition_size": 8,
base_model: mistralai/Mistral-Small-24B-Instruct-2501 plugins: - axolotl.integrations.liger.LigerPlugin liger_rope: true liger_rms_norm: true liger_swiglu: true liger_fused_linear_cross_entropy: true datasets: - path: yentinglin/s1K-1.1-trl-format type: chat_template chat_template: tokenizer_default field_messages: messages message_field_role: role message_field_content: content dataset_prepared_path: val_set_size: 0.01 output_dir: ./placeholder/ sequence_len: 2048 sample_packing: false eval_sample_packing: False pad_to_sequence_len: true eval_steps: 1 gradient_accumulation_steps: 1 micro_batch_size: 1 num_epochs: 5 optimizer: adamw_torch_fused lr_scheduler: cosine learning_rate: 2e-5 max_grad_norm: 1.0 train_on_inputs: false group_by_length: false bf16: false fp16: true tf32: false bfloat16: false float16: true gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: true logging_steps: 1 flash_attention: false xformers_attention: false sdp_attention: false warmup_ratio: 0.1 saves_per_epoch: 2 weight_decay: 0.0 deepspeed: deepspeed_configs/zero3_pp.json
The first iteration will result in a loss of ~11, whereas it should normally be around 1.3.
Expected behavior The loss will gradually decrease from around 1.3.
ds_report output
-------------------------------------------------- DeepSpeed C++/CUDA extension op report -------------------------------------------------- NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op. -------------------------------------------------- JIT compiled ops requires ninja ninja .................. [OKAY] -------------------------------------------------- op name ................ installed .. compatible -------------------------------------------------- async_io ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_lion ............... [NO] ....... [OKAY] [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH evoformer_attn ......... [NO] ....... [NO] [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures [WARNING] FP Quantizer is using an untested triton version (3.2.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels fp_quantizer ........... [NO] ....... [NO] fused_lamb ............. [NO] ....... [OKAY] fused_lion ............. [NO] ....... [OKAY] INFO:root:x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -c /tmp/tmpb26lo_n0/test.c -o /tmp/tmpb26lo_n0/test.o INFO:root:x86_64-linux-gnu-gcc /tmp/tmpb26lo_n0/test.o -L/usr/local/cuda -L/usr/local/cuda/lib64 -lcufile -o /tmp/tmpb26lo_n0/a.out gds .................... [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] inference_core_ops ..... [NO] ....... [OKAY] cutlass_ops ............ [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] ragged_device_ops ...... [NO] ....... [OKAY] ragged_ops ............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.5 [WARNING] using untested triton version (3.2.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] -------------------------------------------------- DeepSpeed general environment info: torch install path ............... ['/usr/local/lib/python3.10/dist-packages/torch'] torch version .................... 2.5.0a0+e000cf0ad9.nv24.10 deepspeed install path ........... ['/usr/local/lib/python3.10/dist-packages/deepspeed'] deepspeed info ................... 0.16.4, unknown, unknown torch cuda version ............... 12.6 torch hip version ................ None nvcc version ..................... 12.6 deepspeed wheel compiled w. ...... torch 2.5, cuda 12.6 shared memory (/dev/shm) size .... 377.15 GB
System info (please complete the following information):
Launcher context
#SBATCH --nodes=4 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=32 #SBATCH --gpus-per-node=8 ... export TORCH_CUDA_ARCH_LIST="7.0" export OMP_NUM_THREADS=1 export MAX_JOBS=16 export PYTHONUNBUFFERED=1 export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True export TOKENIZERS_PARALLELISM=false export NCCL_SOCKET_IFNAME=ib0 export NCCL_IB_HCA=mlx5_0 export GPUS_PER_NODE=8 export NNODES=$SLURM_NNODES export CMD="torchrun \ --nnodes $NNODES \ --nproc_per_node 8 \ --rdzv_id $SLURM_JOB_ID \ --rdzv_backend c10d \ --rdzv_endpoint "$MASTER_ADDR:$MASTER_PORT" \ -m axolotl.cli.train train.yml \ " .... srun $SRUN_ARGS --jobid $SLURM_JOBID $SINGULARITY_RUN bash -c '$CMD'
Docker context Image: alexht/axolotl:24.10-py3
FROM nvcr.io/nvidia/pytorch:24.10-py3 ENV TORCH_CUDA_ARCH_LIST="7.0" ARG AXOLOTL_EXTRAS="" ARG AXOLOTL_ARGS="" RUN apt-get update \ && apt-get install -y --allow-change-held-packages \ wget git build-essential ninja-build git-lfs libaio-dev pkg-config \ vim curl nano libnccl2 libnccl-dev rsync s3fs \ && rm -rf /var/lib/apt/lists/* WORKDIR /workspace RUN MAX_JOBS=8 pip install -v -U git+https://github.com/facebookresearch/[email protected]#egg=xformers RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_QUANTIZER=1 \ pip install deepspeed --global-option="build_ext" --global-option="-j8" RUN git clone https://github.com/alex-ht/axolotl.git WORKDIR /workspace/axolotl # If AXOLOTL_EXTRAS is set, append it in brackets RUN if [ "$AXOLOTL_EXTRAS" != "" ] ; then \ pip install --no-build-isolation -e .[optimizers,ray,$AXOLOTL_EXTRAS] $AXOLOTL_ARGS; \ else \ pip install --no-build-isolation -e .[optimizers,ray] $AXOLOTL_ARGS; \ fi RUN python scripts/unsloth_install.py | sh RUN python scripts/cutcrossentropy_install.py | sh # So we can test the Docker image #RUN pip install pytest # fix so that git fetch/pull from remote works RUN git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*" && \ git config --get remote.origin.fetch # helper for huggingface-login cli RUN git config --global credential.helper store
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Describe the bug
Enabling hpZ causes an abnormally large loss.
To Reproduce
Steps to reproduce the behavior:
"zero_hpz_partition_size": 8,
to deepspeed config file.Here is my setting:
The first iteration will result in a loss of ~11, whereas it should normally be around 1.3.
Expected behavior
The loss will gradually decrease from around 1.3.
ds_report output
System info (please complete the following information):
Launcher context
Docker context
Image: alexht/axolotl:24.10-py3
The text was updated successfully, but these errors were encountered: