Skip to content

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Dec 8, 2025

Mirrored from ggml-org/llama.cpp#17859

Make sure to read the contributing guidelines before submitting a PR

Summary

This PR implements INI-based per-model configuration for llama-server router mode, as discussed in #17850.

Motivation

This POC addresses multi-model inference servers targeting small/medium teams with declarative, user-friendly configuration and zero operational friction.

Implementation

Core Features

  • Auto-generated config.ini: Created at /config.ini on first run, one [vendor/model] section per discovered model (HF-style layout)
  • CLI to INI templating: All router flags (except blacklist: --port, -m, etc.) are converted to LLAMA_ARG_* env var names and injected as initial config template for each model
  • Standard INI format: Booleans stored as =true, regular values as =value, users can override with =false for explicit opt-out
  • Config priority over CLI: Existing user modifications in INI are preserved (never overwritten by new CLI args)
  • Hot-sync new args: When operators add CLI flags to the router, they're automatically synced to all model sections in INI (if not already present)
  • Env var passthrough: Child processes receive config as LLAMA_ARG_*= environment variables (empty for bools), respecting llama.cpp's native conventions
  • Per-model customization: Operators edit INI to override any parameter per model (e.g., desired quantization, --n-cpu-moe, --ctx-size)
  • Adding a new flag to the router CLI automatically propagates it to all existing models in config.ini (if not already set), making it easy to discover and apply llama.cpp arguments without manual editing.

Technical Details

  • Uses existing PEG parser from common/peg-parser.h (thanks @aldehir for the grammar suggestion in #17850)
  • LLAMA_ARG_* env var naming simplifies CLI->INI conversion by avoiding ambiguity between short/long form flags (thanks @ngxson for pointing this out)
  • New API: common_arg_get_env_name() to map CLI flags to env var names
  • Improved model discovery:
    • Recursive scan supporting vendor/model/*.gguf layouts
    • Picks smallest GGUF per directory (for quantization variants)
    • Auto-detects mmproj with priority: BF16 > F16 > F32
  • Thread-safe hot-reload on file modification time change

Example config.ini

LLAMA_CONFIG_VERSION=1

[ggml-org/gemma-3-4b-it-qat-GGUF]
LLAMA_ARG_MODEL=ggml-org/gemma-3-4b-it-qat-GGUF/gemma-3-4b-it.Q6_K.gguf
LLAMA_ARG_N_GPU_LAYERS=999
LLAMA_ARG_CTX_SIZE=32768
LLAMA_ARG_FLASH_ATTN=true

[ggml-org/gemma-3-12b-it-qat-GGUF]
LLAMA_ARG_MODEL=ggml-org/gemma-3-12b-it-qat-GGUF/gemma-3-12b-it.Q4_K_M.gguf
LLAMA_ARG_N_GPU_LAYERS=50
LLAMA_ARG_CTX_SIZE=16384
LLAMA_ARG_FLASH_ATTN=false  # Override: disable for this model

Use Case Example

Small dev team runs inference server with 10+ models. Sysadmin sets global defaults via router CLI:

llama-server --models-dir ./models -ngl 999 -fa -ctk q8_0 -ctv q8_0

Then fine-tunes per-model settings in config.ini:

  • Adjust quantization and context sizes for different models
  • Disable flash-attn for models with compatibility issues
  • Zero downtime config reload (edit INI, router syncs on next model load)

With this system, you can override any parameter per model to optimize each configuration for your GPU, and reset by just deleting the ini file! It also allows beginners to discover llama.cpp arguments as they go along.

Testing

Tested with personal GGUF collection:

  • Multiple vendors/models with various quantizations
  • mmproj auto-detection working correctly
  • Config hot-reload verified
  • Embedding pipeline use case testing in progress

Future Work

This is a POC. Potential improvements:

  • Support for --config mycfg.ini CLI arg (alternative to --models-dir root)
  • GUI administration interface for editing config (security concern)
  • Validation of LLAMA_ARG_* keys against actual arg definitions

Related

#17850
#17470
#10932

@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #486

Overview

This PR introduces INI-based per-model configuration for llama-server router mode with 575 additions across 8 files. The changes are isolated to server initialization and configuration management paths, with no modifications to core inference functions.

Key Findings

Inference Performance Impact:
No impact on tokens per second. Core inference functions (llama_decode, llama_encode, llama_tokenize) show zero changes in response time or throughput. The PR modifies only server configuration and model discovery code, which executes during initialization, not during token generation.

Observed Performance Variations:
The performance analysis identified changes in STL container operations unrelated to this PR:

  • _M_bucket_index: +36 ns response time (compiler optimization artifact)
  • end (hashtable): +78 ns response time (compiler optimization artifact)
  • consume_regex: +108 ns throughput (compiler optimization artifact)

These functions are in chat template parsing and hashtable operations, not in the inference path. The absolute changes (36-108 ns) represent negligible overhead occurring only during initialization.

Power Consumption:
Minimal changes across all binaries:

  • build.bin.llama-tts: -121 nJ (-0.048%)
  • build.bin.llama-cvector-generator: +86 nJ (+0.035%)
  • All other binaries: 0 nJ change

The absolute power consumption changes are within measurement noise, indicating no meaningful energy efficiency impact.

Code Changes:
The PR adds configuration management infrastructure:

  • New server_config_manager class for INI parsing and hot-reload
  • common_arg_get_env_name() API for CLI-to-environment variable mapping
  • Enhanced model discovery with recursive directory scanning
  • Per-model environment variable injection for child processes

All new code executes during server startup and model loading, not during inference. The implementation uses proper thread synchronization with mutexes and handles configuration hot-reload via file modification time checks.

Conclusion:
This PR has zero impact on inference performance and tokens per second. The observed STL container performance variations are compiler optimization differences between versions, not caused by the code changes in this PR.

@loci-dev loci-dev force-pushed the main branch 22 times, most recently from b29e20d to 0e7b989 Compare December 10, 2025 16:11
ServeurpersoCom and others added 5 commits December 10, 2025 18:16
Replace flat directory scan with recursive traversal using
std::filesystem::recursive_directory_iterator. Support for
nested vendor/model layouts (e.g. vendor/model/*.gguf).
Model name now reflects the relative path within --models-dir
instead of just the filename. Aggregate files by parent
directory via std::map before constructing local_model
PEG parser usage improvements:
- Simplify parser instantiation (remove arena indirection)
- Optimize grammar usage (ws instead of zero_or_more, remove optional wrapping)
- Fix last line without newline bug (+ operator instead of <<)
- Remove redundant end position check

Feature scope:
- Remove auto-reload feature (will be separate PR per @ngxson)
- Keep config.ini auto-creation and template generation
- Preserve per-model customization logic

Co-authored-by: aldehir <[email protected]>
Co-authored-by: ngxson <[email protected]>
Complete rewrite of INI parser grammar and visitor:
- Use p.chars(), p.negate(), p.any() instead of p.until()
- Support end-of-line comments (key=value # comment)
- Handle EOF without trailing newline correctly
- Strict identifier validation ([a-zA-Z_][a-zA-Z0-9_.-]*)
- Simplified visitor (no pending state, no trim needed)
- Grammar handles whitespace natively via eol rule

Business validation preserved:
- Reject section names starting with LLAMA_ARG_*
- Accept only keys starting with LLAMA_ARG_*
- Require explicit section before key-value pairs

Co-authored-by: aldehir <[email protected]>
Children now receive minimal CLI args (executable, model, port, alias)
instead of inheriting all router args. Global settings pass through
LLAMA_ARG_* environment variables only, eliminating duplicate config
warnings.

Fixes: Router args like -ngl, -fa were passed both via CLI and env,
causing 'will be overwritten' warnings on every child spawn
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 4664cb4 to 799183f Compare December 15, 2025 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants