UPSTREAM PR #17859: Server: router per model config #486

loci-dev · 2025-12-08T10:40:37Z

Make sure to read the contributing guidelines before submitting a PR

Summary

This PR implements INI-based per-model configuration for llama-server router mode, as discussed in #17850.

Motivation

This POC addresses multi-model inference servers targeting small/medium teams with declarative, user-friendly configuration and zero operational friction.

Implementation

Core Features

Auto-generated config.ini: Created at /config.ini on first run, one [vendor/model] section per discovered model (HF-style layout)
CLI to INI templating: All router flags (except blacklist: --port, -m, etc.) are converted to LLAMA_ARG_* env var names and injected as initial config template for each model
Standard INI format: Booleans stored as =true, regular values as =value, users can override with =false for explicit opt-out
Config priority over CLI: Existing user modifications in INI are preserved (never overwritten by new CLI args)
Hot-sync new args: When operators add CLI flags to the router, they're automatically synced to all model sections in INI (if not already present)
Env var passthrough: Child processes receive config as LLAMA_ARG_*= environment variables (empty for bools), respecting llama.cpp's native conventions
Per-model customization: Operators edit INI to override any parameter per model (e.g., desired quantization, --n-cpu-moe, --ctx-size)
Adding a new flag to the router CLI automatically propagates it to all existing models in config.ini (if not already set), making it easy to discover and apply llama.cpp arguments without manual editing.

Technical Details

Uses existing PEG parser from common/peg-parser.h (thanks @aldehir for the grammar suggestion in #17850)
LLAMA_ARG_* env var naming simplifies CLI->INI conversion by avoiding ambiguity between short/long form flags (thanks @ngxson for pointing this out)
New API: common_arg_get_env_name() to map CLI flags to env var names
Improved model discovery:
- Recursive scan supporting vendor/model/*.gguf layouts
- Picks smallest GGUF per directory (for quantization variants)
- Auto-detects mmproj with priority: BF16 > F16 > F32
Thread-safe hot-reload on file modification time change

Example config.ini

LLAMA_CONFIG_VERSION=1

[ggml-org/gemma-3-4b-it-qat-GGUF]
LLAMA_ARG_MODEL=ggml-org/gemma-3-4b-it-qat-GGUF/gemma-3-4b-it.Q6_K.gguf
LLAMA_ARG_N_GPU_LAYERS=999
LLAMA_ARG_CTX_SIZE=32768
LLAMA_ARG_FLASH_ATTN=true

[ggml-org/gemma-3-12b-it-qat-GGUF]
LLAMA_ARG_MODEL=ggml-org/gemma-3-12b-it-qat-GGUF/gemma-3-12b-it.Q4_K_M.gguf
LLAMA_ARG_N_GPU_LAYERS=50
LLAMA_ARG_CTX_SIZE=16384
LLAMA_ARG_FLASH_ATTN=false  # Override: disable for this model

Use Case Example

Small dev team runs inference server with 10+ models. Sysadmin sets global defaults via router CLI:

llama-server --models-dir ./models -ngl 999 -fa -ctk q8_0 -ctv q8_0

Then fine-tunes per-model settings in config.ini:

Adjust quantization and context sizes for different models
Disable flash-attn for models with compatibility issues
Zero downtime config reload (edit INI, router syncs on next model load)

With this system, you can override any parameter per model to optimize each configuration for your GPU, and reset by just deleting the ini file! It also allows beginners to discover llama.cpp arguments as they go along.

Testing

Tested with personal GGUF collection:

Multiple vendors/models with various quantizations
mmproj auto-detection working correctly
Config hot-reload verified
Embedding pipeline use case testing in progress

Future Work

This is a POC. Potential improvements:

Support for --config mycfg.ini CLI arg (alternative to --models-dir root)
GUI administration interface for editing config (security concern)
Validation of LLAMA_ARG_* keys against actual arg definitions

Performance Analysis Summary - PR #486

Overview

This PR introduces INI-based per-model configuration for llama-server router mode with 575 additions across 8 files. The changes are isolated to server initialization and configuration management paths, with no modifications to core inference functions.

Key Findings

Inference Performance Impact:
No impact on tokens per second. Core inference functions (llama_decode, llama_encode, llama_tokenize) show zero changes in response time or throughput. The PR modifies only server configuration and model discovery code, which executes during initialization, not during token generation.

Observed Performance Variations:
The performance analysis identified changes in STL container operations unrelated to this PR:

_M_bucket_index: +36 ns response time (compiler optimization artifact)
end (hashtable): +78 ns response time (compiler optimization artifact)
consume_regex: +108 ns throughput (compiler optimization artifact)

These functions are in chat template parsing and hashtable operations, not in the inference path. The absolute changes (36-108 ns) represent negligible overhead occurring only during initialization.

Power Consumption:
Minimal changes across all binaries:

build.bin.llama-tts: -121 nJ (-0.048%)
build.bin.llama-cvector-generator: +86 nJ (+0.035%)
All other binaries: 0 nJ change

The absolute power consumption changes are within measurement noise, indicating no meaningful energy efficiency impact.

Code Changes:
The PR adds configuration management infrastructure:

New server_config_manager class for INI parsing and hot-reload
common_arg_get_env_name() API for CLI-to-environment variable mapping
Enhanced model discovery with recursive directory scanning
Per-model environment variable injection for child processes

All new code executes during server startup and model loading, not during inference. The implementation uses proper thread synchronization with mutexes and handles configuration hot-reload via file modification time checks.

Conclusion:
This PR has zero impact on inference performance and tokens per second. The observed STL container performance variations are compiler optimization differences between versions, not caused by the code changes in this PR.

Replace flat directory scan with recursive traversal using std::filesystem::recursive_directory_iterator. Support for nested vendor/model layouts (e.g. vendor/model/*.gguf). Model name now reflects the relative path within --models-dir instead of just the filename. Aggregate files by parent directory via std::map before constructing local_model

@ngxson

PEG parser usage improvements: - Simplify parser instantiation (remove arena indirection) - Optimize grammar usage (ws instead of zero_or_more, remove optional wrapping) - Fix last line without newline bug (+ operator instead of <<) - Remove redundant end position check Feature scope: - Remove auto-reload feature (will be separate PR per @ngxson) - Keep config.ini auto-creation and template generation - Preserve per-model customization logic Co-authored-by: aldehir <[email protected]> Co-authored-by: ngxson <[email protected]>

Complete rewrite of INI parser grammar and visitor: - Use p.chars(), p.negate(), p.any() instead of p.until() - Support end-of-line comments (key=value # comment) - Handle EOF without trailing newline correctly - Strict identifier validation ([a-zA-Z_][a-zA-Z0-9_.-]*) - Simplified visitor (no pending state, no trim needed) - Grammar handles whitespace natively via eol rule Business validation preserved: - Reject section names starting with LLAMA_ARG_* - Accept only keys starting with LLAMA_ARG_* - Require explicit section before key-value pairs Co-authored-by: aldehir <[email protected]>

Children now receive minimal CLI args (executable, model, port, alias) instead of inheriting all router args. Global settings pass through LLAMA_ARG_* environment variables only, eliminating duplicate config warnings. Fixes: Router args like -ngl, -fa were passed both via CLI and env, causing 'will be overwritten' warnings on every child spawn

loci-dev temporarily deployed to PROD__AL_DEMO December 8, 2025 10:40 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 1bd5bdc to 32aa2bc Compare December 8, 2025 11:08

loci-dev force-pushed the main branch 22 times, most recently from b29e20d to 0e7b989 Compare December 10, 2025 16:11

ServeurpersoCom and others added 5 commits December 10, 2025 18:16

server : router config POC (INI-based per-model settings)

972369e

loci-dev force-pushed the main branch 30 times, most recently from 4664cb4 to 799183f Compare December 15, 2025 21:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17859: Server: router per model config #486

UPSTREAM PR #17859: Server: router per model config #486

Uh oh!

loci-dev commented Dec 8, 2025

Uh oh!

loci-agentic-ai bot commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

UPSTREAM PR #17859: Server: router per model config #486

Are you sure you want to change the base?

UPSTREAM PR #17859: Server: router per model config #486

Uh oh!

Conversation

loci-dev commented Dec 8, 2025

Summary

Motivation

Implementation

Technical Details

Example config.ini

Use Case Example

Testing

Future Work

Related

Uh oh!

loci-agentic-ai bot commented Dec 8, 2025

Performance Analysis Summary - PR #486

Overview

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants