Expert Parallelism #373

xrsrke · 2025-06-11T11:40:43Z

Run nanotron's moe implementation

CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-file  /fsx/phuc/new_workspace/snippets/experiment_configs/qwen_moe/exp20b1_3A30B_qwen_moe_and_num_experts_128_and_topk12_nanotron_imple.yaml

Run nanotron's TE implementation

/fsx/phuc/new_workspace/snippets/experiment_configs/qwen_moe/exp20b0_3A30B_qwen_moe_and_num_experts_128_and_topk12_te_imple.yaml

* Update clm_collator.py * can only merge to main from dev (#348) --------- Co-authored-by: Nouamane Tazi <[email protected]>

…anotron into nouamane/lighteval

This reverts commit 17dad0a.

…ert output in unpermute. tested correctness with ep + edp

NouamaneTazi and others added 30 commits April 14, 2025 14:16

can only merge to main from dev

1ca42e3

Fix UnBoundLocalError in clm_collator.py (#339)

0dbf24d

* Update clm_collator.py * can only merge to main from dev (#348) --------- Co-authored-by: Nouamane Tazi <[email protected]>

InitScalingMethod

d4e9daf

InitScalingMethod

6e7f0fa

eval

24d07e5

try adding lightevalrunner to trainer

438257a

amend

4f8a350

amend

c9c479d

amend

190a6b9

amend

004a89c

amend

b4cbb55

amend

d39872b

.

feb818a

amend

025f314

amend

abe75af

.

bd50c66

qos to low

2227432

add nanotron_path

b62cacd

some fix: logs, and config

802fad6

cp instead of sync

895354a

eval_interval

55a5d3e

serialize sanity checks

298492e

add output dir and s3_save path in the config

4219ec8

add output dir and s3_save path in the config

f1780ec

fix s3 only if define

016760e

fixes

85138ca

Merge branch 'nouamane/lighteval' of https://github.com/huggingface/n…

0390de2

…anotron into nouamane/lighteval

add requeue

fefb560

move moe from qwen modeling to src/nn

f1160f1

add groupedmlp

bb8ac96

NouamaneTazi and others added 28 commits May 12, 2025 11:33

Revert "rmsnorm"

83e28d5

This reverts commit 17dad0a.

rope_seq_len_interpolation_factor

b4142da

add moe ep

d99aa4b

add saving moe checkpoints

aa4e754

add resume moe checkpoints

f0d410f

adapt edp to te moe

e04f6be

fix edp+ep saving/resume checkpoint

d36905b

use all_gather_into_tensor in token dispatcher

d24e603

use fused permute, megablock's grouped gemm, and fuse multiplying exp…

559d532

…ert output in unpermute. tested correctness with ep + edp

check init, fwd in case of ep=1

5d36aef

add all2all

df8cd20

loss goes down after fixing data labels

135c715

fix bias_activation_fusion and set to True by default

98df8cc

disable collect_env

88c2ee0

preprocess_data

acad09b

fix bug

1a8fb35

assert in case of top_k=1

d678ae7

add SimpleTokenDataset

e233a7d

add timer for debug

7499256

fix all-to-all

55fb95f

scripts/scaling_moe_benchmark.py

aea7910

fix no gradients, and expert device has no tokens

d30ae0f

add compute moe params

88d4140

fix sync_tied_weights_gradients without gradient accumulator

9e687c4

fix sync_tied_weights_gradients

06faa91

fix cannot import name 'Qwen2Config'

ce5f61c

resolve merge conflicts

af69364

resolve merge conflcits, fix timer in training loop

cc09277

xrsrke requested review from NouamaneTazi and zzhhjjj June 11, 2025 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Expert Parallelism #373

Expert Parallelism #373

Uh oh!

xrsrke commented Jun 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Expert Parallelism #373

Are you sure you want to change the base?

Expert Parallelism #373

Uh oh!

Conversation

xrsrke commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

xrsrke commented Jun 11, 2025 •

edited

Loading