RuntimeError in _group_tensors_by_device_and_dtype
(torch/optim/optimizer.py) when training with FSDP on N>1 GPUs.
#34730
Labels
System Info
transformers
version: 4.46.2Who can help?
@muellerzr @sunm
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Output Error
Original Code
Command
Command that triggers the error (considering the previous code is in a file called
bug.py
)Expected behavior
I'm trying to fine-tune a model using the
Trainer
library. I am usingTorchRun
with FSDP to distribute the training over multiple GPUs. If I run the provided code with a single process, it works fine. However, if I increasenproc_per_node
, I get the error provided with the example.This error first seemed to be a PyTorch error, for which I created an issue here. However, as pointed out by @JoyceZhangSS and @jiaqiw09, this is an issue related to transformers version 4.46.2: 4.46.1 does not have this bug, and training happens as expected.
I reproduced the error in a standalone file with a dummy dataset, provided in this issue. However, it occurs with any dataset, and with the standard loss: the default alpaca training code leads to the same error, both with Llama and OPT models. I did some investigation into the issue that might be helpful:
torch/optim/adamw.py:480
:transformers==4.46.1
, all groups are float32.trainer.py
, you did the following change:This context seems responsible for syncing the gradients across devices, so I tried reverting this change, and the error stops happening. I don't know enough about this to understand what the context does precisely, or why you do not want to rely on it, but removing seems to be what broke the code.
The text was updated successfully, but these errors were encountered: