-
Notifications
You must be signed in to change notification settings - Fork 1
UPSTREAM PR #17859: Server: router per model config #486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
UPSTREAM PR #17859: Server: router per model config #486
Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary - PR #486OverviewThis PR introduces INI-based per-model configuration for llama-server router mode with 575 additions across 8 files. The changes are isolated to server initialization and configuration management paths, with no modifications to core inference functions. Key FindingsInference Performance Impact: Observed Performance Variations:
These functions are in chat template parsing and hashtable operations, not in the inference path. The absolute changes (36-108 ns) represent negligible overhead occurring only during initialization. Power Consumption:
The absolute power consumption changes are within measurement noise, indicating no meaningful energy efficiency impact. Code Changes:
All new code executes during server startup and model loading, not during inference. The implementation uses proper thread synchronization with mutexes and handles configuration hot-reload via file modification time checks. Conclusion: |
b29e20d to
0e7b989
Compare
Replace flat directory scan with recursive traversal using std::filesystem::recursive_directory_iterator. Support for nested vendor/model layouts (e.g. vendor/model/*.gguf). Model name now reflects the relative path within --models-dir instead of just the filename. Aggregate files by parent directory via std::map before constructing local_model
PEG parser usage improvements: - Simplify parser instantiation (remove arena indirection) - Optimize grammar usage (ws instead of zero_or_more, remove optional wrapping) - Fix last line without newline bug (+ operator instead of <<) - Remove redundant end position check Feature scope: - Remove auto-reload feature (will be separate PR per @ngxson) - Keep config.ini auto-creation and template generation - Preserve per-model customization logic Co-authored-by: aldehir <[email protected]> Co-authored-by: ngxson <[email protected]>
Complete rewrite of INI parser grammar and visitor: - Use p.chars(), p.negate(), p.any() instead of p.until() - Support end-of-line comments (key=value # comment) - Handle EOF without trailing newline correctly - Strict identifier validation ([a-zA-Z_][a-zA-Z0-9_.-]*) - Simplified visitor (no pending state, no trim needed) - Grammar handles whitespace natively via eol rule Business validation preserved: - Reject section names starting with LLAMA_ARG_* - Accept only keys starting with LLAMA_ARG_* - Require explicit section before key-value pairs Co-authored-by: aldehir <[email protected]>
Children now receive minimal CLI args (executable, model, port, alias) instead of inheriting all router args. Global settings pass through LLAMA_ARG_* environment variables only, eliminating duplicate config warnings. Fixes: Router args like -ngl, -fa were passed both via CLI and env, causing 'will be overwritten' warnings on every child spawn
4664cb4 to
799183f
Compare
Mirrored from ggml-org/llama.cpp#17859
Make sure to read the contributing guidelines before submitting a PR
Summary
This PR implements INI-based per-model configuration for llama-server router mode, as discussed in #17850.
Motivation
This POC addresses multi-model inference servers targeting small/medium teams with declarative, user-friendly configuration and zero operational friction.
Implementation
Core Features
Technical Details
Example config.ini
Use Case Example
Small dev team runs inference server with 10+ models. Sysadmin sets global defaults via router CLI:
llama-server --models-dir ./models -ngl 999 -fa -ctk q8_0 -ctv q8_0
Then fine-tunes per-model settings in config.ini:
With this system, you can override any parameter per model to optimize each configuration for your GPU, and reset by just deleting the ini file! It also allows beginners to discover llama.cpp arguments as they go along.
Testing
Tested with personal GGUF collection:
Future Work
This is a POC. Potential improvements:
Related
#17850
#17470
#10932