-
Notifications
You must be signed in to change notification settings - Fork 669
Open
Description
#2782 broke TP, the previous commit was fine. Error messages look like:
[rank2]: File "/home/songhappy/git/torchtune/recipes/full_finetune_distributed.py", line 956, in train
[rank2]: current_loss.backward()
[rank2]: File "/home/songhappy/miniforge3/envs/guoqiong-pt-2/lib/python3.10/site-packages/torch/_tensor.py", line 648, in backward
[rank2]: torch.autograd.backward(
[rank2]: File "/home/songhappy/miniforge3/envs/guoqiong-pt-2/lib/python3.10/site-packages/torch/autograd/__init__.py", line 353, in backward
[rank2]: _engine_run_backward(
[rank2]: File "/home/songhappy/miniforge3/envs/guoqiong-pt-2/lib/python3.10/site-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
[rank2]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank2]: RuntimeError: Function SliceBackward0 returned an invalid gradient at index 0 - got [753, 4096] but expected shape compatible with [763, 4096]
Steps to reproduce:
git checkout 9983bbc1dd137f17f951d080f5638050ced432af
tune run --nproc_per_node 4 full_finetune_distributed --config $YOUR_CONFIG\
seed=123 dataset.packed=True log_level=DEBUG\
tokenizer.path=$MODEL_PATH/original/tokenizer.model \
checkpointer.checkpoint_dir=$MODEL_PATH \
tensor_parallel_dim=2 \
output_dir=$OUTPUT_PATH/llama3_1/8b_full_tp2 \
max_steps_per_epoch=$max_steps_per_epoch tokenizer.max_seq_len=512 batch_size=2
Metadata
Metadata
Assignees
Labels
No labels