How to train sharded model using multi-host TPUs? #3318

Quasar-Kim · 2023-09-11T10:34:26Z

Quasar-Kim
Sep 11, 2023

Hello FLAX community,

I've been experimenting with parallel training feature in JAX/FLAX. I was able to utilize TPU v3-8 by annotating parameters with nn.with_partitioning and activations with lax.with_sharding_constraints, as described in the parallel training guide.

Now I want to scale the model to work with multi-host TPUs (e.g. v3-32). But I wasn't able to find any guide / example regarding this. So my questions are:

Is it possible?
If it's possible, then how should I partition the parameters?
- Is passing shardings (using global mesh) to jit() via in_shardings and out_shardings parameters sufficient? I'm concerned about that it might cause parameters to be replicated since each hosts might try to place parameters respectively.
- If not, then does FLAX provide utilities for multi-host sharding?
Is there any relevant example(s)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train sharded model using multi-host TPUs? #3318

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

How to train sharded model using multi-host TPUs? #3318

Quasar-Kim Sep 11, 2023

Replies: 0 comments

Quasar-Kim
Sep 11, 2023