remove scatter_add in MoE implementation #1974

garrett361 · 2025-10-31T13:57:22Z

PR for removing scatter_add in the MoE implementation. scatter_add is somewhat problematic as it is non-deterministic due to the necessity of atomic adds for correctness.

The current draft PR replaces the scatter_add_ by batched matmuls and has multiple correctness tests. This replacement has the effect of making the implementation ~3x more accurate, by some measures, and ~3% more performant per triton.testing.do_bench, presumably due to avoiding atomic adds. Though for yet-unknown reasons this doesn't seem to result in e2e speedups. These numbers are for a DSv3 16B-like config.

The current form is not suitable for merging (hence, draft status). Future plan before merging, if this is a direction we want:

Remove helper test_moe.py, README_DEV.md and Makefile files.
Delete the MoEOld class.

garrett361 · 2025-10-31T14:30:47Z

tests/unit_tests/test_moe.py

+        print(f"Speedup: {moe_old_time_ms/moe_time_ms=}")
+
+
+if __name__ == "__main__":


On an H100 I'm getting these results for this script:

❯ python tests/unit_tests/test_moe.py ACCURACY VS FFN: 10 iterations mean_moe_old_rel_err.mean()=tensor(0.0097), mean_moe_old_rel_err.std()=tensor(4.2804e-05) mean_moe_rel_err.mean()=tensor(0.0025), mean_moe_rel_err.std()=tensor(1.4838e-05) mean_moe_old_rel_err.mean()/mean_moe_rel_err.mean()=tensor(3.8742) TRITON BENCH: perf_seqlen=4096 perf_bsz=4 warmups=100 repeats=1000 moe_old_time_ms=19.720215759277345 moe_time_ms=19.050707193521355 Speedup: moe_old_time_ms/moe_time_ms=1.0351435019684556 MoEOld AND MoE CLOSE: score_before_experts=True MoEOld AND MoE CLOSE: score_before_experts=False

wait where is this error coming from for the new implementation, if it's run-to-run deterministic (I'm assuming the feed forward one is also deterministic)?

By err you mean the standard deviation across runs in the ACCURACY VS FFN: 10 iterations section? If so that is from me running ten iterations with different random seeds each time. Just collecting better statistics than a single run, not directly testing run-to-run determinism here.

Each run I computed the relative error between an FFN layer and the should-be-equivalent MoE layer and appended the result to the mean_moe_rel_err tensor, e.g. And the printout above shows the mean and std across those ten runs.

I added a determinism test locally:

def test_determinism(self): torch.manual_seed(42) moe_old, moe = self._get_moe_old_and_moe_layers(score_before_experts=False) inputs = torch.randn( self.bsz, self.seqlen, self.dim, device=self.device, dtype=torch.bfloat16, ) out_moe_old = [] out_moe = [] with torch.no_grad(): for _ in range(100): out_moe_old.append(moe_old(inputs)) out_moe.append(moe(inputs)) out_moe_old = torch.stack(out_moe_old, dim=0) out_moe = torch.stack(out_moe, dim=0) print(f"{out_moe_old.std(dim=0).mean()=}") print(f"{out_moe.std(dim=0).mean()=}")

The printout shows the new impl is deterministic, while the old is not:

⬢ [podman] ❯ pytest -rsfP tests/llama3_moe/test_moe.py -k determinism ===================================================================================================== test session starts ===================================================================================================== platform linux -- Python 3.11.9, pytest-8.4.2, pluggy-1.6.0 rootdir: /app/torchtitan configfile: pyproject.toml plugins: cov-7.0.0, dtest-0.0.0, anyio-4.11.0, typeguard-4.4.4 collected 4 items / 3 deselected / 1 selected tests/llama3_moe/test_moe.py . [100%] =========================================================================================================== PASSES ============================================================================================================ _________________________________________________________________________________________________ TestModel.test_determinism __________________________________________________________________________________________________ ---------------------------------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------------------------------- out_moe_old.std(dim=0).mean()=tensor(0.0659, device='cuda:0', dtype=torch.bfloat16) out_moe.std(dim=0).mean()=tensor(0., device='cuda:0', dtype=torch.bfloat16) =============================================================================================== 1 passed, 3 deselected in 8.99s ===============================================================================================

CC @rakkit who I recall discussed some MoE non-determinism findings in another issue/PR.

garrett361 · 2025-10-31T14:36:00Z

CC @tianyu-l @rakkit , let me know your thoughts on this?

tianyu-l · 2025-10-31T19:16:39Z

@garrett361 Actually I was thinking about doing this, too. Plan sounds good to me. Thanks a lot!

garrett361 · 2025-10-31T19:46:27Z

Cleaned it up and removed testing code @tianyu-l, fyi.

torchtitan/models/moe/moe.py

tianyu-l · 2025-11-02T07:46:44Z

torchtitan/models/moe/moe.py

+            # and outputs in top_scores and routed_input, respectively.
+            top_scores = top_scores.flatten()
+            top_scores[token_indices_experts_sorted] = top_scores_experts_sorted
+            routed_input[token_indices_experts_sorted] = routed_output


Also save some allocations: store unsorted scores and outputs in top_scores and routed_input, respectively.

Hmm please educate me more here:
From Python level this might save memory, but wouldn't the routed_input activations be saved by the autograd engine anyways?

Are you observing meaningful savings? If not, I'd wish we separate the concern and focus on run-to-run determinism in this PR.

The comment is probably overly-terse, and also maybe I'm just wrong?

The comment is in reference to an alternative implementation where we do something like:

# new alloc from empty_like, avoided in the current code routed_output_sorted = torch.empty_like(routed_output) routed_output_sorted[token_indices_experts_sorted] = routed_output

or

# new alloc from clone, avoided in the current code routed_output[token_indices_experts_sorted] = routed_output.clone()

The clone is needed in the second case, because otherwise routed_output[token_indices_experts_sorted] = routed_output leads to

RuntimeError: unsupported operation: some elements of the input tensor and the written-to tensor refer to a single memory location. Please clone() the tensor before performing the operation.

Which is what I initially tried.

So the point here is about avoiding a clone or empty_like or other similar call that would cause a new allocation.

For training, both routed_input and routed_output are surely in the computational graph still: this doesn't avoid that.

LMK if I'm wrong in any of the above, and if you want code or comment changes or any profiling of the different options here.

@tianyu-l let me know if I can answer more questions here or elsewhere on the PR, thanks!

tianyu-l · 2025-11-02T07:49:31Z

torchtitan/models/moe/moe.py

+            top_scores = top_scores.flatten()
+            top_scores[token_indices_experts_sorted] = top_scores_experts_sorted
+            routed_input[token_indices_experts_sorted] = routed_output
+            routed_input = routed_input.reshape(bs * slen, -1, dim)


I'm worried that when EP > 1, ETP = 1, the routed_input no longer has length bs * slen * top_k, given the code here https://github.com/pytorch/torchtitan/blob/main/torchtitan/distributed/expert_parallel.py#L254

Assuming even split, which is the only split supported now, maybe we can do reshape(-1, top_k, dim)? But in the long term, our goal is to make this part DTensor rather than plain tensor so model code don't need to worry too much about parallelism.

I wasn't aware of this code path at all (and can't say I fully understand it, from a quick glance). But in any case, if we're ever doing some leading-dim splitting inside the MoE layer, then yes, definitely preferable to reshape with explicit top_k's rather than bs and slen factors.

I'm a little confused about shapes everywhere, though, in this path. Seems like:

All of the reorderer's outputs are split on dim0.

The above causes routed_input to be split on dim0

And also out_experts to be split on dim0?

If the last one is true, then I think I also have problems in the case where there is a shared expert since it seems like (out + out_experts).reshape(bs, slen, dim) will fail on shapes. Is that right? I see that this step wouldn't fail on upstream due to the scatter_add, but I don't yet understand why it is correct.

tianyu-l · 2025-11-02T07:53:44Z

tests/unit_tests/test_moe.py

+        print(f"Speedup: {moe_old_time_ms/moe_time_ms=}")
+
+
+if __name__ == "__main__":


wait where is this error coming from for the new implementation, if it's run-to-run deterministic (I'm assuming the feed forward one is also deterministic)?

garrett361 · 2025-11-13T14:23:02Z

wait where is this error coming from for the new implementation, if it's run-to-run deterministic (I'm assuming the feed forward one is also deterministic)?

I responded to this, but it's not showing up in the main PR UI, probably because the file is outdated, so copying below:

By err you mean the standard deviation across runs in the ACCURACY VS FFN: 10 iterations section? If so that is from me running ten iterations with different random seeds each time. Just collecting better statistics than a single run, not directly testing run-to-run determinism here.

Each run I computed the relative error between an FFN layer and the should-be-equivalent MoE layer and appended the result to the mean_moe_rel_err tensor, e.g. And the printout above shows the mean and std across those ten runs.

I added a determinism test locally:

def test_determinism(self):
    torch.manual_seed(42)
    moe_old, moe = self._get_moe_old_and_moe_layers(score_before_experts=False)
    inputs = torch.randn(
        self.bsz,
        self.seqlen,
        self.dim,
        device=self.device,
        dtype=torch.bfloat16,
    )

    out_moe_old = []
    out_moe = []
    with torch.no_grad():
        for _ in range(100):
            out_moe_old.append(moe_old(inputs))
            out_moe.append(moe(inputs))
    out_moe_old = torch.stack(out_moe_old, dim=0)
    out_moe = torch.stack(out_moe, dim=0)
    print(f"{out_moe_old.std(dim=0).mean()=}")
    print(f"{out_moe.std(dim=0).mean()=}")

The printout shows the new impl is deterministic, while the old is not:

⬢ [podman] ❯ pytest -rsfP tests/llama3_moe/test_moe.py -k determinism
===================================================================================================== test session starts =====================================================================================================
platform linux -- Python 3.11.9, pytest-8.4.2, pluggy-1.6.0
rootdir: /app/torchtitan
configfile: pyproject.toml
plugins: cov-7.0.0, dtest-0.0.0, anyio-4.11.0, typeguard-4.4.4
collected 4 items / 3 deselected / 1 selected

tests/llama3_moe/test_moe.py .                                                                                                                                                                                          [100%]

=========================================================================================================== PASSES ============================================================================================================
_________________________________________________________________________________________________ TestModel.test_determinism __________________________________________________________________________________________________
---------------------------------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------------------------------
out_moe_old.std(dim=0).mean()=tensor(0.0659, device='cuda:0', dtype=torch.bfloat16)
out_moe.std(dim=0).mean()=tensor(0., device='cuda:0', dtype=torch.bfloat16)
=============================================================================================== 1 passed, 3 deselected in 8.99s ===============================================================================================

tianyu-l · 2025-11-14T05:27:25Z

@garrett361 will get back to you as soon as I got time to look into this more...

garrett361 · 2025-11-14T17:36:32Z

Are you observing meaningful savings? If not, I'd wish we separate the concern and focus on run-to-run determinism in this PR.

Made a branch for ablating this bit of code here. Two alternative impls:

MoEClone: unshuffle tensors via cloning
MoEEmptyLike: unshuffle tensors via empty_like.

Diff can be visualized in this PR.

The memory savings can be be fairly significant for a DSv3 16B layer fwd + bwd pass. The cloning strategy uses ~3% more peak memory and empty_like ~6% more than the current impl:

❯ python moe_bench_and_test/moe_memory.py moe && python moe_bench_and_test/moe_memory.py moe_clone && python moe_bench_and_test/moe_memory.py moe_empty_like

args=Namespace(cls='moe', iters=1, seqlen=4096, bsz=4)
peak_stats.max_active_gib=4.676033020019531
peak_stats.max_reserved_gib=5.537109375

args=Namespace(cls='moe_clone', iters=1, seqlen=4096, bsz=4)
peak_stats.max_active_gib=4.832531452178955
peak_stats.max_reserved_gib=5.724609375

args=Namespace(cls='moe_empty_like', iters=1, seqlen=4096, bsz=4)
peak_stats.max_active_gib=5.051399230957031
peak_stats.max_reserved_gib=5.912109375

Memory differences should primarily be from avoiding creating a new routed_output-shaped tensors, which is 384MiB in this test setup.

Cloning is also ~16% slower than the impl in this PR, though empty_like is marginally faster:

❯ python moe_bench_and_test/moe_timing.py moe && python moe_bench_and_test/moe_timing.py moe_clone && python moe_bench_and_test/moe_timing.py moe_empty_like

args=Namespace(cls='moe', perf_reps=1000, perf_warmups=100, seqlen=4096, bsz=4)
moe_time_ms=19.091190814971924

args=Namespace(cls='moe_clone', perf_reps=1000, perf_warmups=100, seqlen=4096, bsz=4)
moe_time_ms=22.239379280491878

args=Namespace(cls='moe_empty_like', perf_reps=1000, perf_warmups=100, seqlen=4096, bsz=4)
moe_time_ms=19.08447210605328

Let me know if there's a similarly memory-optimized strategy you can think of though, or a way to make the current impl more clear; it could certainly be improved.

tianyu-l

Please see if my comments make sense to you.

Also please show in the PR summary that before vs. after give the same / better numerical results, with / without the TP & ETP=1 enabled.

tianyu-l · 2025-11-17T01:48:32Z

torchtitan/models/moe/moe.py

+            # Unsort scores and routed outputs. Also save some allocations: store unsorted scores
+            # and outputs in top_scores and routed_input, respectively.
+            top_scores = top_scores.flatten()
+            top_scores[token_indices_experts_sorted] = top_scores_experts_sorted


why do we need to modify the scores? IIUC, the score themselves never changed, it's just the ordering that's changed.

This is just undoing the shuffling (avoiding the scatter_add) and being memory efficient by saving the result in a tensor that already exists and that we don't need anymore.

Does the below version make sense to you?

top_scores_experts_unsorted = torch.empty_like(top_scores_experts_sorted) top_scores_experts_unsorted[token_indices_experts_sorted] = ( top_scores_experts_sorted ) routed_output_unsorted = torch.empty_like(routed_output) routed_output_unsorted[token_indices_experts_sorted] = routed_output routed_output_unsorted = routed_output_unsorted.reshape( -1, self.router.top_k, dim ) top_scores_experts_unsorted = top_scores_experts_unsorted.reshape( -1, 1, self.router.top_k ) out_experts = ( torch.bmm(top_scores_experts_unsorted, routed_output_unsorted.float()) .to(x.dtype) .squeeze(1) )

The present version is equivalent, and more memory efficient.

In any case, I'll replace the current version with something more like what's above and what you suggested elsewhere, since we need changes for TP anyway and the present version is certainly confusion-prone.

My question was: can you directly use top_scores (with shape (bs * seq_len, top_k)) which is the output of self.router?

Oh, I see! Yeah we should be able to, good call.

tianyu-l · 2025-11-17T02:02:16Z

torchtitan/models/moe/moe.py

            top_scores_experts_sorted,
            token_indices_experts_sorted,
            num_tokens_per_expert,
        ) = self.reorderer(top_scores, selected_experts_indices)


When TP is enabled and ETP=1, we'll have the following

before:
top_scores, selected_experts_indices both having shape (bs * seq_len, top_k)

after:
top_scores_experts_sorted, token_indices_experts_sorted both having shape ((bs * seq_len // tp_degree) * top_k,)
Moreover, the indices in token_indices_experts_sorted refers to the original indices. So you can think of both as the no-TP case sharded along TP dimension.

If the last one is true, then I think I also have problems in the case where there is a shared expert since it seems like (out + out_experts).reshape(bs, slen, dim) will fail on shapes. Is that right? I see that this step wouldn't fail on upstream due to the scatter_add, but I don't yet understand why it is correct.

It is correct because scatter_add can use token_indices_experts_sorted as the index arg to add src to correct places. Note that out still have bs * seq_len as first dimension, but routed_output and routed_input both have (bs * seq_len // tp_degree) * top_k as first dim.

In order to make (out + out_experts).reshape(bs, slen, dim) to work, we need output_experts to also have first dim bs*seq_len. That means it's impossible to just reuse routed_input, when TP > 1 and ETP==1. Concretely I would suggest:

make a new variable e.g. routed_output_tokens_sorted holding zeros of shape (bs, slen, top_k, dim)

routed_output_tokens_sorted[token_indices_experts_sorted] = routed_output

bmm between top_scores and token_indices_experts_sorted (after necessary shape change on top_scores)

finally do (out + out_experts).reshape(bs, slen, dim)

WDYT?

Ok, I still need to understand this code path in a bit more detail, but I believe it makes sense.

Understanding check: is there a small typo here?

Note that out still have bs * seq_len as first dimension, but routed_output and routed_input both have bs * seq_len as first dim. [...] That means it's impossible to just reuse routed_input, when TP > 1 and ETP==1

Should routed_{input,output} have bs * seq_len // tp_degree as their first dim? That's the obstruction?

it's a typo -- it should be (bs * seq_len // tp_degree) * top_k

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 31, 2025

garrett361 force-pushed the moe-bmm branch from 8a4147a to 0b958b1 Compare October 31, 2025 14:04

garrett361 commented Oct 31, 2025

View reviewed changes

garrett361 changed the title ~~remove scatter_add_ in MoE implementation~~ remove scatter_add in MoE implementation Oct 31, 2025

garrett361 mentioned this pull request Oct 31, 2025

remove scatter_add from MoE impl garrett361/torchtitan#28

Merged

garrett361 marked this pull request as ready for review October 31, 2025 19:45

garrett361 requested review from fegin, tianyu-l, wconstab and wwwjn as code owners October 31, 2025 19:45

tianyu-l reviewed Nov 2, 2025

View reviewed changes

garrett361 added 12 commits November 2, 2025 19:34

remove scatter_add

853d0bf

lint

41125e1

reset moe.py to minimize diff

51f59d8

minimze diff

66b67ed

minor edit

b5f651a

clean up MoE

2aa866b

rm testing/dev files

e395a68

rm comment

df60297

typo fix

af76834

comment shape fix

d3f56de

comment consolidation

1c3cc87

top_k in reshapes, not bs*slen

ee2e8cc

garrett361 force-pushed the moe-bmm branch from 425820d to ee2e8cc Compare November 2, 2025 19:34

tianyu-l requested changes Nov 17, 2025

View reviewed changes

		print(f"Speedup: {moe_old_time_ms/moe_time_ms=}")


		if __name__ == "__main__":

remove scatter_add in MoE implementation #1974

Are you sure you want to change the base?

remove scatter_add in MoE implementation #1974

Uh oh!

Conversation

garrett361 commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

garrett361 commented Oct 31, 2025

Uh oh!

tianyu-l commented Oct 31, 2025

Uh oh!

garrett361 commented Oct 31, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

garrett361 Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

garrett361 commented Nov 13, 2025

Uh oh!

tianyu-l commented Nov 14, 2025

Uh oh!

garrett361 commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

garrett361 Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

garrett361 commented Oct 31, 2025 •

edited

Loading

garrett361 Nov 2, 2025 •

edited

Loading

garrett361 commented Nov 14, 2025 •

edited

Loading

garrett361 Nov 17, 2025 •

edited

Loading

tianyu-l Nov 17, 2025 •

edited

Loading