[P0] Initiate the support of FSDP training (#205) #204

frankaging · 2025-02-02T20:58:35Z

Description

The base intervenable class does not support DPP or FSDP. One reason is that the interventions are saved in a python-native dict, rather than a torch-native dict which makes DP wrappers unaware of params of interventions.

This is the first change to implement and validate pyvene backbone is compatible with DPP and FSDP. This change is crucial for the pyreft library to work with larger models, especially when inference-time # token is being scaled.

Testing Done

First to pass the current tests.

Checklist:

My PR title strictly follows the format: [Your Priority] Your Title
I have attached the testing log above
I provide enough comments to my code
I have changed documentations
I have added tests for my changes

frankaging · 2025-02-02T21:04:13Z

parent ticket: #205

frankaging · 2025-02-02T22:26:16Z

DDP

/u/nlp/anaconda/main/anaconda3/envs/wuzhengx-310/lib/python3.10/site-packages/transformers/data/data_collator.py:656: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:274.)
  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
{'loss': 1.3422, 'grad_norm': 0.0, 'learning_rate': 2e-05, 'epoch': 1.0}                                                                                                                                            
{'loss': 0.227, 'grad_norm': 0.0, 'learning_rate': 0.0, 'epoch': 1.14}                                                                                                                                              
{'train_runtime': 52.0409, 'train_samples_per_second': 3.843, 'train_steps_per_second': 0.038, 'train_loss': 0.7845921963453293, 'epoch': 1.14}                                                                     
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:17<00:00,  8.54s/it]
Saving

frankaging · 2025-02-03T08:33:06Z

FSDP

forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
{'loss': 1.5969, 'grad_norm': 0.022196302190423012, 'learning_rate': 2e-05, 'epoch': 0.62}                                                                                           
{'loss': 1.5111, 'grad_norm': 0.02029985934495926, 'learning_rate': 1e-05, 'epoch': 1.23}                                                                                            
{'loss': 1.4569, 'grad_norm': 0.01986486278474331, 'learning_rate': 0.0, 'epoch': 1.85}                                                                                              
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.02it/s]Directory './test_multi_gpu/checkpoint-3/intervenable_model' created successfully.
{'train_runtime': 23.1573, 'train_samples_per_second': 12.955, 'train_steps_per_second': 0.13, 'train_loss': 1.5216662883758545, 'epoch': 1.85}                                      
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.11s/it]
Directory './test_multi_gpu//intervenable_model' created successfully.

[P0] Initiate the support of FSDP training

0d90a44

frankaging requested review from aryamanarora and ZhengPeterWang February 2, 2025 20:58

frankaging self-assigned this Feb 2, 2025

frankaging added the enhancement New feature or request label Feb 2, 2025

frankaging changed the title ~~[P0] Initiate the support of FSDP training~~ [P0] Initiate the support of FSDP training (#205) Feb 2, 2025

frankaging added 3 commits February 2, 2025 13:06

minor fix

f3c01fe

fix tutorial

8895ed5

update all tutorials (no testing)

b2aa098

ZhengPeterWang approved these changes Feb 2, 2025

View reviewed changes

frankaging added 3 commits February 2, 2025 17:20

update the dict to torch module dict

8aa6d35

potential breaking change: naming convention change

7610e3c

fix key

820553f

frankaging added 2 commits February 3, 2025 00:36

finish up

bc07940

update tutorials (no test)

262ed44

frankaging mentioned this pull request Feb 3, 2025

[P0] Enabling FSDP with pyvene backbone change stanfordnlp/pyreft#152

Merged

frankaging merged commit 781cd02 into main Feb 3, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[P0] Initiate the support of FSDP training (#205) #204

[P0] Initiate the support of FSDP training (#205) #204

frankaging commented Feb 2, 2025 •

edited

Loading

frankaging commented Feb 2, 2025

frankaging commented Feb 2, 2025

frankaging commented Feb 3, 2025

[P0] Initiate the support of FSDP training (#205) #204

[P0] Initiate the support of FSDP training (#205) #204

Conversation

frankaging commented Feb 2, 2025 • edited Loading

Description

Testing Done

Checklist:

frankaging commented Feb 2, 2025

frankaging commented Feb 2, 2025

frankaging commented Feb 3, 2025

frankaging commented Feb 2, 2025 •

edited

Loading