TP broken

https://github.com/pytorch/torchtune/pull/2782 broke TP, the previous commit was fine. Error messages look like:

```
[rank2]:   File "/home/songhappy/git/torchtune/recipes/full_finetune_distributed.py", line 956, in train
[rank2]:     current_loss.backward()
[rank2]:   File "/home/songhappy/miniforge3/envs/guoqiong-pt-2/lib/python3.10/site-packages/torch/_tensor.py", line 648, in backward
[rank2]:     torch.autograd.backward(
[rank2]:   File "/home/songhappy/miniforge3/envs/guoqiong-pt-2/lib/python3.10/site-packages/torch/autograd/__init__.py", line 353, in backward
[rank2]:     _engine_run_backward(
[rank2]:   File "/home/songhappy/miniforge3/envs/guoqiong-pt-2/lib/python3.10/site-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
[rank2]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank2]: RuntimeError: Function SliceBackward0 returned an invalid gradient at index 0 - got [753, 4096] but expected shape compatible with [763, 4096] 
```

Steps to reproduce:

``` bash
 git checkout 9983bbc1dd137f17f951d080f5638050ced432af
 tune run  --nproc_per_node 4 full_finetune_distributed  --config $YOUR_CONFIG\
     seed=123 dataset.packed=True log_level=DEBUG\
    tokenizer.path=$MODEL_PATH/original/tokenizer.model \
    checkpointer.checkpoint_dir=$MODEL_PATH \
    tensor_parallel_dim=2 \
    output_dir=$OUTPUT_PATH/llama3_1/8b_full_tp2 \
    max_steps_per_epoch=$max_steps_per_epoch tokenizer.max_seq_len=512 batch_size=2 
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TP broken #2880

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TP broken #2880

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions