Skip to content

[rank0]: RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 1694 but got size 1404 for tensor number 2 in the list. #1862

@mukherjeesougata-eros

Description

@mukherjeesougata-eros

run_custom.sh

Description

I encountered a runtime error while running a modified version of run.sh for CosyVoice3 training. The error occurs during the flow model training stage after resuming llm training from a checkpoint.

Error log

[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/data0/Sougata/TTS/CosyVoice_l/CosyVoice/examples/libritts/cosyvoice3/../../../cosyvoice/bin/train.py", line 195, in <module>
[rank0]:     main()
[rank0]:   File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
[rank0]:     return f(*args, **kwargs)
[rank0]:   File "/mnt/data0/Sougata/TTS/CosyVoice_l/CosyVoice/examples/libritts/cosyvoice3/../../../cosyvoice/bin/train.py", line 190, in main
[rank0]:     executor.train_one_epoc(model, optimizer, scheduler, train_data_loader, cv_data_loader, writer, info_dict, scaler, group_join, ref_model=ref_model)
[rank0]:   File "/mnt/data0/Sougata/TTS/CosyVoice_l/CosyVoice/cosyvoice/utils/executor.py", line 72, in train_one_epoc
[rank0]:     info_dict = batch_forward(model, batch_dict, scaler, info_dict, ref_model=self.ref_model, dpo_loss=self.dpo_loss)
[rank0]:   File "/mnt/data0/Sougata/TTS/CosyVoice_l/CosyVoice/cosyvoice/utils/train_utils.py", line 255, in batch_forward
[rank0]:     info_dict['loss_dict'] = model(batch, device)
[rank0]:   File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
[rank0]:     else self._run_ddp_forward(*inputs, **kwargs)
[rank0]:   File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1411, in _run_ddp_forward
[rank0]:     return self.module(*inputs, **kwargs)  # type: ignore[index]
[rank0]:   File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/mnt/data0/Sougata/TTS/CosyVoice_l/CosyVoice/cosyvoice/flow/flow.py", line 359, in forward
[rank0]:     loss, _ = self.decoder.compute_loss(
[rank0]:   File "/mnt/data0/Sougata/TTS/CosyVoice_l/CosyVoice/cosyvoice/flow/flow_matching.py", line 191, in compute_loss
[rank0]:     pred = self.estimator(y, mask, mu, t.squeeze(), spks, cond, streaming=streaming)
[rank0]:   File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/mnt/data0/Sougata/TTS/CosyVoice_l/CosyVoice/cosyvoice/flow/DiT/dit.py", line 158, in forward
[rank0]:     x = self.input_embed(x, cond, mu, spks.squeeze(1))
[rank0]:   File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/mnt/data0/Sougata/TTS/CosyVoice_l/CosyVoice/cosyvoice/flow/DiT/dit.py", line 98, in forward
[rank0]:     x = self.proj(torch.cat(to_cat, dim=-1))
[rank0]: RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 1694 but got size 1404 for tensor number 2 in the list.

Changes Made and Training Sequence

  • llm training was running but stopped multiple times.

  • I resumed llm training using:

    --checkpoint path/to/epoch_{epoch}_whole.pt, Here epoch_{epoch}_whole.pt indicate the last saved checkpoint in llm training.

    Example:

    --checkpoint /mnt/.../epoch_199_whole.pt

  • Training was initially started with:

    max_frames_in_batch = 2000

    Later I changed:

    max_frames_in_batch = 84000

    and then resumed llm training from checkpoint epoch_199_whole.pt.

  • After llm training, flow training was taking a startup.

Modification in run.sh (Stage 5)

if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
  echo "Run train. We only support llm traning for now"
  if [ $train_engine == 'deepspeed' ]; then
    echo "Notice deepspeed has its own optimizer config. Modify conf/ds_stage2.json if necessary"
  fi
  cat data/{train-clean-100,train-clean-360,train-other-500}/parquet/data.list > data/train.data.list
  cat data/{dev-clean,dev-other}/parquet/data.list > data/dev.data.list
  #for model in llm flow hifigan; do
  torchrun --nnodes=1 --nproc_per_node=$num_gpus \
      --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
    ../../../cosyvoice/bin/train.py \
    --train_engine $train_engine \
    --config conf/cosyvoice3.yaml \
    --train_data data/train.data.list \
    --cv_data data/dev.data.list \
    --qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
    --onnx_path $pretrained_model_dir \
    --model llm \
    --checkpoint /mnt/data0/Sougata/TTS/CosyVoice_l/CosyVoice/examples/libritts/cosyvoice3/exp/cosyvoice3/llm/torch_ddp/epoch_199_whole.pt \
    --model_dir `pwd`/exp/cosyvoice3/llm/$train_engine \
    --tensorboard_dir `pwd`/tensorboard/cosyvoice3/llm/$train_engine \
    --ddp.dist_backend $dist_backend \
    --num_workers ${num_workers} \
    --prefetch ${prefetch} \
    --pin_memory \
    --use_amp \
    --deepspeed_config ./conf/ds_stage2.json \
    --deepspeed.save_states model+optimizer

  torchrun --nnodes=1 --nproc_per_node=$num_gpus \
      --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
    ../../../cosyvoice/bin/train.py \
    --train_engine $train_engine \
    --config conf/cosyvoice3.yaml \
    --train_data data/train.data.list \
    --cv_data data/dev.data.list \
    --qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
    --onnx_path $pretrained_model_dir \
    --model flow \
    --checkpoint $pretrained_model_dir/flow.pt \
    --model_dir `pwd`/exp/cosyvoice3/flow/$train_engine \
    --tensorboard_dir `pwd`/tensorboard/cosyvoice3/flow/$train_engine \
    --ddp.dist_backend $dist_backend \
    --num_workers ${num_workers} \
    --prefetch ${prefetch} \
    --pin_memory \
    --use_amp \
    --deepspeed_config ./conf/ds_stage2.json \
    --deepspeed.save_states model+optimizer

  torchrun --nnodes=1 --nproc_per_node=$num_gpus \
      --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
    ../../../cosyvoice/bin/train.py \
    --train_engine $train_engine \
    --config conf/cosyvoice3.yaml \
    --train_data data/train.data.list \
    --cv_data data/dev.data.list \
    --qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
    --onnx_path $pretrained_model_dir \
    --model hifigan \
    --checkpoint $pretrained_model_dir/hifigan.pt \
    --model_dir `pwd`/exp/cosyvoice3/hifigan/$train_engine \
    --tensorboard_dir `pwd`/tensorboard/cosyvoice3/hifigan/$train_engine \
    --ddp.dist_backend $dist_backend \
    --num_workers ${num_workers} \
    --prefetch ${prefetch} \
    --pin_memory \
    --use_amp \
    --deepspeed_config ./conf/ds_stage2.json \
    --deepspeed.save_states model+optimizer 
  #done
fi

The modified bash script used for running the training is attached as run_custom.sh.

Kindly please help in resolving the error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions