-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Description
I encountered a runtime error while running a modified version of run.sh for CosyVoice3 training. The error occurs during the flow model training stage after resuming llm training from a checkpoint.
Error log
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/data0/Sougata/TTS/CosyVoice_l/CosyVoice/examples/libritts/cosyvoice3/../../../cosyvoice/bin/train.py", line 195, in <module>
[rank0]: main()
[rank0]: File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
[rank0]: return f(*args, **kwargs)
[rank0]: File "/mnt/data0/Sougata/TTS/CosyVoice_l/CosyVoice/examples/libritts/cosyvoice3/../../../cosyvoice/bin/train.py", line 190, in main
[rank0]: executor.train_one_epoc(model, optimizer, scheduler, train_data_loader, cv_data_loader, writer, info_dict, scaler, group_join, ref_model=ref_model)
[rank0]: File "/mnt/data0/Sougata/TTS/CosyVoice_l/CosyVoice/cosyvoice/utils/executor.py", line 72, in train_one_epoc
[rank0]: info_dict = batch_forward(model, batch_dict, scaler, info_dict, ref_model=self.ref_model, dpo_loss=self.dpo_loss)
[rank0]: File "/mnt/data0/Sougata/TTS/CosyVoice_l/CosyVoice/cosyvoice/utils/train_utils.py", line 255, in batch_forward
[rank0]: info_dict['loss_dict'] = model(batch, device)
[rank0]: File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
[rank0]: else self._run_ddp_forward(*inputs, **kwargs)
[rank0]: File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1411, in _run_ddp_forward
[rank0]: return self.module(*inputs, **kwargs) # type: ignore[index]
[rank0]: File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/mnt/data0/Sougata/TTS/CosyVoice_l/CosyVoice/cosyvoice/flow/flow.py", line 359, in forward
[rank0]: loss, _ = self.decoder.compute_loss(
[rank0]: File "/mnt/data0/Sougata/TTS/CosyVoice_l/CosyVoice/cosyvoice/flow/flow_matching.py", line 191, in compute_loss
[rank0]: pred = self.estimator(y, mask, mu, t.squeeze(), spks, cond, streaming=streaming)
[rank0]: File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/mnt/data0/Sougata/TTS/CosyVoice_l/CosyVoice/cosyvoice/flow/DiT/dit.py", line 158, in forward
[rank0]: x = self.input_embed(x, cond, mu, spks.squeeze(1))
[rank0]: File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/mnt/data0/anaconda_dir/miniconda3/envs/cosyvoice/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/mnt/data0/Sougata/TTS/CosyVoice_l/CosyVoice/cosyvoice/flow/DiT/dit.py", line 98, in forward
[rank0]: x = self.proj(torch.cat(to_cat, dim=-1))
[rank0]: RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 1694 but got size 1404 for tensor number 2 in the list.
Changes Made and Training Sequence
-
llmtraining was running but stopped multiple times. -
I resumed
llmtraining using:--checkpoint path/to/epoch_{epoch}_whole.pt, Hereepoch_{epoch}_whole.ptindicate the last saved checkpoint inllmtraining.Example:
--checkpoint /mnt/.../epoch_199_whole.pt -
Training was initially started with:
max_frames_in_batch = 2000Later I changed:
max_frames_in_batch = 84000and then resumed
llmtraining from checkpointepoch_199_whole.pt. -
After
llmtraining,flowtraining was taking a startup.
Modification in run.sh (Stage 5)
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
echo "Run train. We only support llm traning for now"
if [ $train_engine == 'deepspeed' ]; then
echo "Notice deepspeed has its own optimizer config. Modify conf/ds_stage2.json if necessary"
fi
cat data/{train-clean-100,train-clean-360,train-other-500}/parquet/data.list > data/train.data.list
cat data/{dev-clean,dev-other}/parquet/data.list > data/dev.data.list
#for model in llm flow hifigan; do
torchrun --nnodes=1 --nproc_per_node=$num_gpus \
--rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
../../../cosyvoice/bin/train.py \
--train_engine $train_engine \
--config conf/cosyvoice3.yaml \
--train_data data/train.data.list \
--cv_data data/dev.data.list \
--qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
--onnx_path $pretrained_model_dir \
--model llm \
--checkpoint /mnt/data0/Sougata/TTS/CosyVoice_l/CosyVoice/examples/libritts/cosyvoice3/exp/cosyvoice3/llm/torch_ddp/epoch_199_whole.pt \
--model_dir `pwd`/exp/cosyvoice3/llm/$train_engine \
--tensorboard_dir `pwd`/tensorboard/cosyvoice3/llm/$train_engine \
--ddp.dist_backend $dist_backend \
--num_workers ${num_workers} \
--prefetch ${prefetch} \
--pin_memory \
--use_amp \
--deepspeed_config ./conf/ds_stage2.json \
--deepspeed.save_states model+optimizer
torchrun --nnodes=1 --nproc_per_node=$num_gpus \
--rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
../../../cosyvoice/bin/train.py \
--train_engine $train_engine \
--config conf/cosyvoice3.yaml \
--train_data data/train.data.list \
--cv_data data/dev.data.list \
--qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
--onnx_path $pretrained_model_dir \
--model flow \
--checkpoint $pretrained_model_dir/flow.pt \
--model_dir `pwd`/exp/cosyvoice3/flow/$train_engine \
--tensorboard_dir `pwd`/tensorboard/cosyvoice3/flow/$train_engine \
--ddp.dist_backend $dist_backend \
--num_workers ${num_workers} \
--prefetch ${prefetch} \
--pin_memory \
--use_amp \
--deepspeed_config ./conf/ds_stage2.json \
--deepspeed.save_states model+optimizer
torchrun --nnodes=1 --nproc_per_node=$num_gpus \
--rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
../../../cosyvoice/bin/train.py \
--train_engine $train_engine \
--config conf/cosyvoice3.yaml \
--train_data data/train.data.list \
--cv_data data/dev.data.list \
--qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
--onnx_path $pretrained_model_dir \
--model hifigan \
--checkpoint $pretrained_model_dir/hifigan.pt \
--model_dir `pwd`/exp/cosyvoice3/hifigan/$train_engine \
--tensorboard_dir `pwd`/tensorboard/cosyvoice3/hifigan/$train_engine \
--ddp.dist_backend $dist_backend \
--num_workers ${num_workers} \
--prefetch ${prefetch} \
--pin_memory \
--use_amp \
--deepspeed_config ./conf/ds_stage2.json \
--deepspeed.save_states model+optimizer
#done
fi
The modified bash script used for running the training is attached as run_custom.sh.
Kindly please help in resolving the error.