-
Notifications
You must be signed in to change notification settings - Fork 641
Refactor turbomind engine #4223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR performs a major refactoring of the turbomind engine architecture with the following key changes:
- Replaces
LlamaTritonModelwith a newTurboMindclass providing a cleaner API - Removes the old batch processing implementation (
LlamaBatch,LlamaV2) - Introduces new model abstractions:
LanguageModel,InputProcessor, andOutputProcessorto better separate concerns - Updates
RequestMetricsfields to use atomic operations for thread-safe access - Consolidates model-related code into a unified
modelsCMake target
Reviewed changes
Copilot reviewed 102 out of 102 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/turbomind/utils/metrics.h | Changed metric fields to atomic types and fixed typo in field name |
| src/turbomind/turbomind.h/cc | New TurboMind class interface replacing LlamaTritonModel |
| src/turbomind/triton_backend/llama/* | Removed old triton backend files |
| src/turbomind/python/bind.cpp | Updated Python bindings to use new TurboMind class |
| src/turbomind/models/language_model.* | New LanguageModel abstraction for inference |
| src/turbomind/models/input_processor.* | New component for handling input processing |
| src/turbomind/models/output_processor.* | New component for handling output processing |
| src/turbomind/models/llama/unified_decoder.* | Updated to work with new architecture |
| src/turbomind/models/llama/unified_attention_layer.* | Refactored attention layer implementation |
| src/turbomind/models/llama/llama_utils.cu | Changed isTuning() from thread_local to static |
| src/turbomind/layers/sampling_layers/* | Removed old sampling layer files |
| src/turbomind/kernels/sampling_kernels.h | Changed sampled_indexes/nums types from uint32_t to int |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Uh oh!
There was an error while loading. Please reload this page.