Refactor turbomind engine #4223

lzhangzz · 2025-12-19T07:58:57Z

Full asynchronous model execution, the execution stream never sync with host explicitly
Modular design, instead of handling every buffer in the engine, modules manage their own stuff
Batched copy for faster data movement

Copilot

Pull request overview

This PR performs a major refactoring of the turbomind engine architecture with the following key changes:

Replaces LlamaTritonModel with a new TurboMind class providing a cleaner API
Removes the old batch processing implementation (LlamaBatch, LlamaV2)
Introduces new model abstractions: LanguageModel, InputProcessor, and OutputProcessor to better separate concerns
Updates RequestMetrics fields to use atomic operations for thread-safe access
Consolidates model-related code into a unified models CMake target

Reviewed changes

Copilot reviewed 102 out of 102 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/turbomind/utils/metrics.h	Changed metric fields to atomic types and fixed typo in field name
src/turbomind/turbomind.h/cc	New TurboMind class interface replacing LlamaTritonModel
src/turbomind/triton_backend/llama/*	Removed old triton backend files
src/turbomind/python/bind.cpp	Updated Python bindings to use new TurboMind class
src/turbomind/models/language_model.*	New LanguageModel abstraction for inference
src/turbomind/models/input_processor.*	New component for handling input processing
src/turbomind/models/output_processor.*	New component for handling output processing
src/turbomind/models/llama/unified_decoder.*	Updated to work with new architecture
src/turbomind/models/llama/unified_attention_layer.*	Refactored attention layer implementation
src/turbomind/models/llama/llama_utils.cu	Changed isTuning() from thread_local to static
src/turbomind/layers/sampling_layers/*	Removed old sampling layer files
src/turbomind/kernels/sampling_kernels.h	Changed sampled_indexes/nums types from uint32_t to int

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/turbomind/utils/metrics.h

src/turbomind/models/llama/llama_utils.cu

refactor turbomind engine

d19d92d

lvhan028 added the improvement label Dec 19, 2025

lzhangzz added 7 commits December 22, 2025 07:58

simplify interface

a7ab1e5

minor

e134666

metrics

af46169

refactor & logprobs

bedc618

fix output logits

3d558e0

fix logprobs

9cc59f8

rename

66be4db

windreamer mentioned this pull request Dec 24, 2025

fix: Fix Guided Decoding Crashes and State Corruption Issues #4167

Open

lzhangzz added 2 commits December 25, 2025 06:00

mrope

4c7c22a

Merge remote-tracking branch 'origin/main' into engine2a

08a0b89

lvhan028 requested a review from Copilot December 26, 2025 12:55

Copilot started reviewing on behalf of lvhan028 December 26, 2025 12:55 View session

Copilot AI reviewed Dec 26, 2025

View reviewed changes

src/turbomind/utils/metrics.h Show resolved Hide resolved

src/turbomind/utils/metrics.h Outdated Show resolved Hide resolved

src/turbomind/models/llama/llama_utils.cu Outdated Show resolved Hide resolved

lzhangzz added 8 commits December 29, 2025 04:32

fix cuda-12.4 build

5fdd37b

ix cuda-12.4 build

0a1ccdc

fix cuda-12.4 build

6223b2f

fix MSVC build

1066193

fix MSVC build

22ae933

fix guided decoding

299f595

fix warm-up for TP

4f3f3d6

fix VLMs

0d08346

lvhan028 mentioned this pull request Jan 4, 2026

[Bug] OOM during model loading for GPT-OSS-120B on single H100 80G #4249

Open

5 tasks

lzhangzz added 6 commits January 5, 2026 08:12

refactor DP

2a0e0ad

remove redundant rank parameter

581bc7b

add no queue error & fix lint

7bd3545

fix vocab size

620d712

fix attn output for finished seqs

708419f

fix lint

a541990

lzhangzz added 9 commits January 7, 2026 04:28

fix lint

949060c

add async flag

05e7059

fix prefix caching

1165254

minor fix

6fdeadd

Merge remote-tracking branch 'origin/main' into engine2a

58c5cd1

fix lint

5454038

fix lint

85dad95

fix typo

065c46a

fix log level

db86948

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor turbomind engine #4223

Refactor turbomind engine #4223

Uh oh!

lzhangzz commented Dec 19, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refactor turbomind engine #4223

Are you sure you want to change the base?

Refactor turbomind engine #4223

Uh oh!

Conversation

lzhangzz commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lzhangzz commented Dec 19, 2025 •

edited

Loading