Skip to content

FSDP2 root level parameter management #1091

@dingqingy

Description

@dingqingy

Hi,

I am curious about the design decision of managing both token embeddings and the final output layer at the root fsdp level instead of treating them as different layers like other transformer blocks?

This coupled management seems to unshard the final output layer too early and reshard the token embedding too late in forward for example.

Also for the optimization (see here) that disables reshard_after_forward for the last transformer block layer, would it be more appropriate to perform this optimization on the final linear layer instead of the last transformer block?

Thanks!

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions