[WIP][DeepSeek] DeepSeek training and component integration with Titan main components #1183

lessw2020 · 2025-05-13T00:11:34Z

Mostly publishing for our status updates, but this PR:
a - starts integration of Deepseek training loop with Torch Titan main components
b - refactors deepseek modeling to start using Titan componenets such as .toml for model config and job config.
c - modularizes deepseek modeling overall.
d - moves all group_gemm components into kernels so that they can be leveraged by other models (including DSGemm)

tianyu-l · 2025-05-13T04:43:05Z

torchtitan/datasets/tokenizer/hf_tokenizer.py

Maybe let's put this under deepseek folder, given its dependency on transformers. We can consider upstreaming it later.

sounds good - thanks for the feedback.

tianyu-l · 2025-05-13T04:48:22Z

torchtitan/config_manager.py

+    expert_parallel_degree: int = 1
+    """Expert parallelism degree. 1 means disabled."""
+


Since this is only for MoE-based models, how about let's use https://github.com/pytorch/torchtitan/blob/main/docs/extension.md#extending-jobconfig
and create a separate .py config file in the deepseek folder. Later we can see if we can reuse them for Llama 4 and DeepSeek.

lessw2020 added 15 commits May 8, 2025 13:29

start debug model toml

edd0e76

create models folder, move relevant files, update imports

12be2ee

add more files to models folder

8c99de1

move group gemm files to kernels.group_gemm and deepseek_gemm

bf0f925

start model integration

9755152

refactor model.py into components

c1b6639

add ds TrainSpec

cafc534

add parallelize_deepseek

74a5c4d

update run_training with matching params as main titan

e1c8a55

start similar run_training.sh for integration

3ca50d6

use parallelism from toml file instead of hardcoded

4a3f565

move parallelism into main loop

d562674

integrate bs and seqlen from config

0457b9c

add world size vs parallel size

fd96293

add hf_tokenizer and hf dataloader

972c211

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 13, 2025

lessw2020 added 2 commits May 12, 2025 20:09

now generating c4 data batches for training

fccc6d3

now training with real c4 data

4f50c6f

tianyu-l reviewed May 13, 2025

View reviewed changes

lessw2020 added 11 commits May 14, 2025 10:59

cross entropy loss working (titan main style)

2fdabcc

training working with real data, optimizer, lr scheduler

921a244

remove synthetic data generation

4c3209b

use toml for training steps

71ac5af

create metrics processor

a1f1bfe

add color

16e44ef

metrics tracking integrated

5ec3cae

start expert token tracking

a363e15

token tracking working

9a04c58

update csv report for token tracking

245d313

export topk reports for csv

4ac7c34

lessw2020 added 2 commits May 20, 2025 12:15

current status for expert token tracking

36ba7e6

small batch run

c594d8a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][DeepSeek] DeepSeek training and component integration with Titan main components #1183

[WIP][DeepSeek] DeepSeek training and component integration with Titan main components #1183

lessw2020 commented May 13, 2025 •

edited

Loading

Uh oh!

tianyu-l May 13, 2025

Uh oh!

lessw2020 May 15, 2025

Uh oh!

tianyu-l May 13, 2025

Uh oh!

Uh oh!

		expert_parallel_degree: int = 1
		"""Expert parallelism degree. 1 means disabled."""

[WIP][DeepSeek] DeepSeek training and component integration with Titan main components #1183

Are you sure you want to change the base?

[WIP][DeepSeek] DeepSeek training and component integration with Titan main components #1183

Conversation

lessw2020 commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l May 13, 2025

Choose a reason for hiding this comment

Uh oh!

lessw2020 May 15, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lessw2020 commented May 13, 2025 •

edited

Loading