Skip to content

FSDP hybrid shard should checkpoint in a single node #19494

@carmocca

Description

@carmocca

Description & Motivation

pytorch/pytorch#104810 adds the recommendation that the save APIs should be called in a single node (shard_group).

pytorch/pytorch#102904 (comment) Also talks about this

Our logic doesn't do this and runs this code in all ranks.

Additional context

Lit-gpt uses hybrid sharding in pretrain/tinyllama.py but full checkpointing. I believe this feature request is only relevant for sharded pointing. @awaelchli Did you try it? Does sharded hybrid checkpointing work?

cc @Borda @awaelchli @carmocca

Metadata

Metadata

Assignees

Labels

checkpointingRelated to checkpointingfeatureIs an improvement or enhancementstrategy: fsdpFully Sharded Data Parallel

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions