-
Notifications
You must be signed in to change notification settings - Fork 604
remove scatter_add in MoE implementation #1974
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
tests/unit_tests/test_moe.py
Outdated
| print(f"Speedup: {moe_old_time_ms/moe_time_ms=}") | ||
|
|
||
|
|
||
| if __name__ == "__main__": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On an H100 I'm getting these results for this script:
❯ python tests/unit_tests/test_moe.py
ACCURACY VS FFN: 10 iterations
mean_moe_old_rel_err.mean()=tensor(0.0097), mean_moe_old_rel_err.std()=tensor(4.2804e-05)
mean_moe_rel_err.mean()=tensor(0.0025), mean_moe_rel_err.std()=tensor(1.4838e-05)
mean_moe_old_rel_err.mean()/mean_moe_rel_err.mean()=tensor(3.8742)
TRITON BENCH: perf_seqlen=4096 perf_bsz=4 warmups=100 repeats=1000
moe_old_time_ms=19.720215759277345
moe_time_ms=19.050707193521355
Speedup: moe_old_time_ms/moe_time_ms=1.0351435019684556
MoEOld AND MoE CLOSE: score_before_experts=True
MoEOld AND MoE CLOSE: score_before_experts=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait where is this error coming from for the new implementation, if it's run-to-run deterministic (I'm assuming the feed forward one is also deterministic)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By err you mean the standard deviation across runs in the ACCURACY VS FFN: 10 iterations section? If so that is from me running ten iterations with different random seeds each time. Just collecting better statistics than a single run, not directly testing run-to-run determinism here.
Each run I computed the relative error between an FFN layer and the should-be-equivalent MoE layer and appended the result to the mean_moe_rel_err tensor, e.g. And the printout above shows the mean and std across those ten runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a determinism test locally:
def test_determinism(self):
torch.manual_seed(42)
moe_old, moe = self._get_moe_old_and_moe_layers(score_before_experts=False)
inputs = torch.randn(
self.bsz,
self.seqlen,
self.dim,
device=self.device,
dtype=torch.bfloat16,
)
out_moe_old = []
out_moe = []
with torch.no_grad():
for _ in range(100):
out_moe_old.append(moe_old(inputs))
out_moe.append(moe(inputs))
out_moe_old = torch.stack(out_moe_old, dim=0)
out_moe = torch.stack(out_moe, dim=0)
print(f"{out_moe_old.std(dim=0).mean()=}")
print(f"{out_moe.std(dim=0).mean()=}")The printout shows the new impl is deterministic, while the old is not:
⬢ [podman] ❯ pytest -rsfP tests/llama3_moe/test_moe.py -k determinism
===================================================================================================== test session starts =====================================================================================================
platform linux -- Python 3.11.9, pytest-8.4.2, pluggy-1.6.0
rootdir: /app/torchtitan
configfile: pyproject.toml
plugins: cov-7.0.0, dtest-0.0.0, anyio-4.11.0, typeguard-4.4.4
collected 4 items / 3 deselected / 1 selected
tests/llama3_moe/test_moe.py . [100%]
=========================================================================================================== PASSES ============================================================================================================
_________________________________________________________________________________________________ TestModel.test_determinism __________________________________________________________________________________________________
---------------------------------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------------------------------
out_moe_old.std(dim=0).mean()=tensor(0.0659, device='cuda:0', dtype=torch.bfloat16)
out_moe.std(dim=0).mean()=tensor(0., device='cuda:0', dtype=torch.bfloat16)
=============================================================================================== 1 passed, 3 deselected in 8.99s ===============================================================================================
CC @rakkit who I recall discussed some MoE non-determinism findings in another issue/PR.
|
@garrett361 Actually I was thinking about doing this, too. Plan sounds good to me. Thanks a lot! |
|
Cleaned it up and removed testing code @tianyu-l, fyi. |
| # and outputs in top_scores and routed_input, respectively. | ||
| top_scores = top_scores.flatten() | ||
| top_scores[token_indices_experts_sorted] = top_scores_experts_sorted | ||
| routed_input[token_indices_experts_sorted] = routed_output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also save some allocations: store unsorted scores and outputs in top_scores and routed_input, respectively.
Hmm please educate me more here:
From Python level this might save memory, but wouldn't the routed_input activations be saved by the autograd engine anyways?
Are you observing meaningful savings? If not, I'd wish we separate the concern and focus on run-to-run determinism in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is probably overly-terse, and also maybe I'm just wrong?
The comment is in reference to an alternative implementation where we do something like:
# new alloc from empty_like, avoided in the current code
routed_output_sorted = torch.empty_like(routed_output)
routed_output_sorted[token_indices_experts_sorted] = routed_outputor
# new alloc from clone, avoided in the current code
routed_output[token_indices_experts_sorted] = routed_output.clone()The clone is needed in the second case, because otherwise routed_output[token_indices_experts_sorted] = routed_output leads to
RuntimeError: unsupported operation: some elements of the input tensor and the written-to tensor refer to a single memory location. Please clone() the tensor before performing the operation.
Which is what I initially tried.
So the point here is about avoiding a clone or empty_like or other similar call that would cause a new allocation.
For training, both routed_input and routed_output are surely in the computational graph still: this doesn't avoid that.
LMK if I'm wrong in any of the above, and if you want code or comment changes or any profiling of the different options here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tianyu-l let me know if I can answer more questions here or elsewhere on the PR, thanks!
torchtitan/models/moe/moe.py
Outdated
| top_scores = top_scores.flatten() | ||
| top_scores[token_indices_experts_sorted] = top_scores_experts_sorted | ||
| routed_input[token_indices_experts_sorted] = routed_output | ||
| routed_input = routed_input.reshape(bs * slen, -1, dim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that when EP > 1, ETP = 1, the routed_input no longer has length bs * slen * top_k, given the code here https://github.com/pytorch/torchtitan/blob/main/torchtitan/distributed/expert_parallel.py#L254
Assuming even split, which is the only split supported now, maybe we can do reshape(-1, top_k, dim)? But in the long term, our goal is to make this part DTensor rather than plain tensor so model code don't need to worry too much about parallelism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware of this code path at all (and can't say I fully understand it, from a quick glance). But in any case, if we're ever doing some leading-dim splitting inside the MoE layer, then yes, definitely preferable to reshape with explicit top_k's rather than bs and slen factors.
I'm a little confused about shapes everywhere, though, in this path. Seems like:
- All of the reorderer's outputs are split on dim0.
- The above causes
routed_inputto be split on dim0 - And also
out_expertsto be split on dim0?
If the last one is true, then I think I also have problems in the case where there is a shared expert since it seems like (out + out_experts).reshape(bs, slen, dim) will fail on shapes. Is that right? I see that this step wouldn't fail on upstream due to the scatter_add, but I don't yet understand why it is correct.
tests/unit_tests/test_moe.py
Outdated
| print(f"Speedup: {moe_old_time_ms/moe_time_ms=}") | ||
|
|
||
|
|
||
| if __name__ == "__main__": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait where is this error coming from for the new implementation, if it's run-to-run deterministic (I'm assuming the feed forward one is also deterministic)?
I responded to this, but it's not showing up in the main PR UI, probably because the file is outdated, so copying below: By err you mean the standard deviation across runs in the ACCURACY VS FFN: 10 iterations section? If so that is from me running ten iterations with different random seeds each time. Just collecting better statistics than a single run, not directly testing run-to-run determinism here. Each run I computed the relative error between an FFN layer and the should-be-equivalent MoE layer and appended the result to the mean_moe_rel_err tensor, e.g. And the printout above shows the mean and std across those ten runs. I added a determinism test locally: def test_determinism(self):
torch.manual_seed(42)
moe_old, moe = self._get_moe_old_and_moe_layers(score_before_experts=False)
inputs = torch.randn(
self.bsz,
self.seqlen,
self.dim,
device=self.device,
dtype=torch.bfloat16,
)
out_moe_old = []
out_moe = []
with torch.no_grad():
for _ in range(100):
out_moe_old.append(moe_old(inputs))
out_moe.append(moe(inputs))
out_moe_old = torch.stack(out_moe_old, dim=0)
out_moe = torch.stack(out_moe, dim=0)
print(f"{out_moe_old.std(dim=0).mean()=}")
print(f"{out_moe.std(dim=0).mean()=}")The printout shows the new impl is deterministic, while the old is not: |
|
@garrett361 will get back to you as soon as I got time to look into this more... |
Made a branch for ablating this bit of code here. Two alternative impls:
Diff can be visualized in this PR. The memory savings can be be fairly significant for a DSv3 16B layer fwd + bwd pass. The cloning strategy uses ~3% more peak memory and Memory differences should primarily be from avoiding creating a new Cloning is also ~16% slower than the impl in this PR, though Let me know if there's a similarly memory-optimized strategy you can think of though, or a way to make the current impl more clear; it could certainly be improved. |
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see if my comments make sense to you.
Also please show in the PR summary that before vs. after give the same / better numerical results, with / without the TP & ETP=1 enabled.
| # Unsort scores and routed outputs. Also save some allocations: store unsorted scores | ||
| # and outputs in top_scores and routed_input, respectively. | ||
| top_scores = top_scores.flatten() | ||
| top_scores[token_indices_experts_sorted] = top_scores_experts_sorted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to modify the scores? IIUC, the score themselves never changed, it's just the ordering that's changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just undoing the shuffling (avoiding the scatter_add) and being memory efficient by saving the result in a tensor that already exists and that we don't need anymore.
Does the below version make sense to you?
top_scores_experts_unsorted = torch.empty_like(top_scores_experts_sorted)
top_scores_experts_unsorted[token_indices_experts_sorted] = (
top_scores_experts_sorted
)
routed_output_unsorted = torch.empty_like(routed_output)
routed_output_unsorted[token_indices_experts_sorted] = routed_output
routed_output_unsorted = routed_output_unsorted.reshape(
-1, self.router.top_k, dim
)
top_scores_experts_unsorted = top_scores_experts_unsorted.reshape(
-1, 1, self.router.top_k
)
out_experts = (
torch.bmm(top_scores_experts_unsorted, routed_output_unsorted.float())
.to(x.dtype)
.squeeze(1)
)The present version is equivalent, and more memory efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, I'll replace the current version with something more like what's above and what you suggested elsewhere, since we need changes for TP anyway and the present version is certainly confusion-prone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question was: can you directly use top_scores (with shape (bs * seq_len, top_k)) which is the output of self.router?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see! Yeah we should be able to, good call.
| top_scores_experts_sorted, | ||
| token_indices_experts_sorted, | ||
| num_tokens_per_expert, | ||
| ) = self.reorderer(top_scores, selected_experts_indices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When TP is enabled and ETP=1, we'll have the following
before:
top_scores, selected_experts_indices both having shape (bs * seq_len, top_k)
after:
top_scores_experts_sorted, token_indices_experts_sorted both having shape ((bs * seq_len // tp_degree) * top_k,)
Moreover, the indices in token_indices_experts_sorted refers to the original indices. So you can think of both as the no-TP case sharded along TP dimension.
If the last one is true, then I think I also have problems in the case where there is a shared expert since it seems like (out + out_experts).reshape(bs, slen, dim) will fail on shapes. Is that right? I see that this step wouldn't fail on upstream due to the scatter_add, but I don't yet understand why it is correct.
It is correct because scatter_add can use token_indices_experts_sorted as the index arg to add src to correct places. Note that out still have bs * seq_len as first dimension, but routed_output and routed_input both have (bs * seq_len // tp_degree) * top_k as first dim.
In order to make (out + out_experts).reshape(bs, slen, dim) to work, we need output_experts to also have first dim bs*seq_len. That means it's impossible to just reuse routed_input, when TP > 1 and ETP==1. Concretely I would suggest:
- make a new variable e.g.
routed_output_tokens_sortedholding zeros of shape(bs, slen, top_k, dim) routed_output_tokens_sorted[token_indices_experts_sorted] = routed_output- bmm between
top_scoresandtoken_indices_experts_sorted(after necessary shape change ontop_scores) - finally do
(out + out_experts).reshape(bs, slen, dim)
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I still need to understand this code path in a bit more detail, but I believe it makes sense.
Understanding check: is there a small typo here?
Note that out still have bs * seq_len as first dimension, but routed_output and routed_input both have bs * seq_len as first dim. [...] That means it's impossible to just reuse routed_input, when TP > 1 and ETP==1
Should routed_{input,output} have bs * seq_len // tp_degree as their first dim? That's the obstruction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a typo -- it should be (bs * seq_len // tp_degree) * top_k
PR for removing
scatter_addin the MoE implementation.scatter_addis somewhat problematic as it is non-deterministic due to the necessity of atomic adds for correctness.The current draft PR replaces the
scatter_add_by batched matmuls and has multiple correctness tests. This replacement has the effect of making the implementation ~3x more accurate, by some measures, and ~3% more performant pertriton.testing.do_bench, presumably due to avoiding atomic adds. Though for yet-unknown reasons this doesn't seem to result in e2e speedups. These numbers are for a DSv3 16B-like config.The current form is not suitable for merging (hence, draft status). Future plan before merging, if this is a direction we want:
test_moe.py,README_DEV.mdandMakefilefiles.MoEOldclass.