-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
checkpointingRelated to checkpointingRelated to checkpointingfeatureIs an improvement or enhancementIs an improvement or enhancementstrategy: fsdpFully Sharded Data ParallelFully Sharded Data Parallel
Milestone
Description
Description & Motivation
pytorch/pytorch#104810 adds the recommendation that the save APIs should be called in a single node (shard_group).
pytorch/pytorch#102904 (comment) Also talks about this
Our logic doesn't do this and runs this code in all ranks.
Additional context
Lit-gpt uses hybrid sharding in pretrain/tinyllama.py but full checkpointing. I believe this feature request is only relevant for sharded pointing. @awaelchli Did you try it? Does sharded hybrid checkpointing work?
awaelchli
Metadata
Metadata
Assignees
Labels
checkpointingRelated to checkpointingRelated to checkpointingfeatureIs an improvement or enhancementIs an improvement or enhancementstrategy: fsdpFully Sharded Data ParallelFully Sharded Data Parallel