FSDP2 root level parameter management

Hi,

I am curious about the design decision of managing both token embeddings and the final output layer at the root fsdp level instead of treating them as different layers like other transformer blocks?

This coupled management seems to unshard the final output layer too early and reshard the token embedding too late in forward for example.

Also for the optimization (see [here](https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/parallelize_llama.py#L369)) that disables `reshard_after_forward` for the last transformer block layer, would it be more appropriate to perform this optimization on the final linear layer instead of the last transformer block?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FSDP2 root level parameter management #1091

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FSDP2 root level parameter management #1091

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions