Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[P0] Initiate the support of FSDP training (#205) #204

Merged
merged 9 commits into from
Feb 3, 2025
Merged

Conversation

frankaging
Copy link
Collaborator

@frankaging frankaging commented Feb 2, 2025

Description

The base intervenable class does not support DPP or FSDP. One reason is that the interventions are saved in a python-native dict, rather than a torch-native dict which makes DP wrappers unaware of params of interventions.

This is the first change to implement and validate pyvene backbone is compatible with DPP and FSDP. This change is crucial for the pyreft library to work with larger models, especially when inference-time # token is being scaled.

Testing Done

First to pass the current tests.

Checklist:

  • My PR title strictly follows the format: [Your Priority] Your Title
  • I have attached the testing log above
  • I provide enough comments to my code
  • I have changed documentations
  • I have added tests for my changes

@frankaging frankaging self-assigned this Feb 2, 2025
@frankaging frankaging added the enhancement New feature or request label Feb 2, 2025
@frankaging frankaging changed the title [P0] Initiate the support of FSDP training [P0] Initiate the support of FSDP training (#205) Feb 2, 2025
@frankaging
Copy link
Collaborator Author

parent ticket: #205

@frankaging
Copy link
Collaborator Author

DDP

/u/nlp/anaconda/main/anaconda3/envs/wuzhengx-310/lib/python3.10/site-packages/transformers/data/data_collator.py:656: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:274.)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
{'loss': 1.3422, 'grad_norm': 0.0, 'learning_rate': 2e-05, 'epoch': 1.0}                                                                                                                                            
{'loss': 0.227, 'grad_norm': 0.0, 'learning_rate': 0.0, 'epoch': 1.14}                                                                                                                                              
{'train_runtime': 52.0409, 'train_samples_per_second': 3.843, 'train_steps_per_second': 0.038, 'train_loss': 0.7845921963453293, 'epoch': 1.14}                                                                     
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:17<00:00,  8.54s/it]
Saving

@frankaging
Copy link
Collaborator Author

FSDP

forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
{'loss': 1.5969, 'grad_norm': 0.022196302190423012, 'learning_rate': 2e-05, 'epoch': 0.62}                                                                                           
{'loss': 1.5111, 'grad_norm': 0.02029985934495926, 'learning_rate': 1e-05, 'epoch': 1.23}                                                                                            
{'loss': 1.4569, 'grad_norm': 0.01986486278474331, 'learning_rate': 0.0, 'epoch': 1.85}                                                                                              
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.02it/s]Directory './test_multi_gpu/checkpoint-3/intervenable_model' created successfully.
{'train_runtime': 23.1573, 'train_samples_per_second': 12.955, 'train_steps_per_second': 0.13, 'train_loss': 1.5216662883758545, 'epoch': 1.85}                                      
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.11s/it]
Directory './test_multi_gpu//intervenable_model' created successfully.

@frankaging frankaging merged commit 781cd02 into main Feb 3, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants