Upstream repo: https://github.com/ggml-org/llama.cpp
Upstream
llama-serverdocs: https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.mdEverything not listed here behaves the same as upstream
llama-server. If a flag, endpoint, or behavior is not mentioned in this document, upstream documentation is accurate and fully applicable.
llama.cpp is a high-performance C/C++ runtime for large language models in GGUF format. This package wraps its built-in HTTP server (llama-server), which exposes an OpenAI-compatible API and a small in-browser chat UI on the same port.
- Image and Container Runtime
- Volume and Data Layout
- Installation and First-Run Flow
- Configuration Management
- Network Access and Interfaces
- Actions (StartOS UI)
- Dependencies
- Backups and Restore
- Health Checks
- Limitations and Differences
- What Is Unchanged from Upstream
- Contributing
- Quick Reference for AI Consumers
The package ships four variants, selected at build time via the VARIANT env var (driven by the Makefile):
| Variant | Image | Arches | Accelerator | Offered to GPU driver |
|---|---|---|---|---|
generic |
ghcr.io/ggml-org/llama.cpp:server |
x86_64, aarch64 | CPU only | — (universal fallback) |
nvidia |
ghcr.io/ggml-org/llama.cpp:server-cuda |
x86_64, aarch64 | CUDA (NVIDIA) | nvidia |
rocm |
ghcr.io/ggml-org/llama.cpp:server-rocm |
x86_64 | ROCm (AMD) | amdgpu |
vulkan |
ghcr.io/ggml-org/llama.cpp:server-vulkan |
x86_64, aarch64 | Vulkan | i915 (Intel) |
All four variants publish under a single package version. Each declares a distinct hardwareRequirements.device (the host GPU's kernel driver), so StartOS serves each host the most specific variant its detected hardware satisfies — nvidia/rocm/vulkan for matching GPUs, and generic as the universal CPU fallback for everything else. Note that vulkan matches only Intel GPUs on the i915 driver; newer Intel GPUs on the xe driver (and non-Intel Vulkan-only setups) fall back to generic.
| Property | Value |
|---|---|
| Entrypoint | /app/llama-server |
| Working dir | /app |
| Default port | 8080 |
| Volume | Mount Point | Purpose |
|---|---|---|
main |
/data |
store.json (serve args) and models/ (GGUF cache) |
The container runs with LLAMA_CACHE=/data/models and HF_HOME=/data/huggingface, so all -hf <repo> downloads land on the persistent volume.
| Step | StartOS |
|---|---|
| Install | Marketplace install or sideload .s9pk |
| First-run tasks | Two critical tasks: Set UI Password (created whenever no password is set) and Set Model (created whenever no model is selected). Both are created on install and re-surface if the underlying value is later cleared. |
| Start service | After Set Model has been run; until then the daemon idles |
| Pull the model | Automatic on first start (cached on the main volume) |
Until Set Model has been run, the daemon stays in an idle (sleep infinity) state and the API port is closed — the health check reports "No model selected." Once a model is selected, llama-server is restarted with the chosen serve arguments.
Serve configuration is stored at /data/store.json and managed via the Set Model action:
{
"serveArgs": [
"-hf",
"unsloth/Qwen2.5-7B-Instruct-GGUF:Q4_K_M",
"-c",
"8192",
"-ngl",
"999"
]
}serveArgs is the exact list of arguments appended after /app/llama-server. The daemon adds --host 0.0.0.0 and --port 8080 at runtime.
llama-server itself runs keyless — no --api-key. Access is instead gated by HTTP basic auth enforced at the StartOS reverse proxy (addSsl.auth): the OS validates credentials before any request reaches the container. The username is hard-coded to admin; the password is generated by the Set UI Password action and stored as uiPassword in store.json. setupInterfaces reads it reactively, so rotating it via the action takes effect without a manual restart. Set UI Password is a critical task, which blocks the service from starting until a password is set — so the service never runs (and the gate never serves) without one.
Dependent StartOS services reach llama.cpp over the internal service mesh (http://llama-cpp.startos:8080), which is not behind the proxy gate, so they connect keyless.
Curated presets: the Set Model action surfaces a hardware-tier-aware list of GGUF presets and disables ones too large for the detected memory:
| Preset | Repo (-hf) |
Min memory |
|---|---|---|
| Llama 3.2 1B Instruct | unsloth/Llama-3.2-1B-Instruct-GGUF:Q4_K_M |
2 GB |
| Llama 3.2 3B Instruct | unsloth/Llama-3.2-3B-Instruct-GGUF:Q4_K_M |
4 GB |
| Qwen2.5 7B Instruct | unsloth/Qwen2.5-7B-Instruct-GGUF:Q4_K_M |
6 GB |
| Llama 3.1 8B Instruct | unsloth/Meta-Llama-3.1-8B-Instruct-GGUF:Q4_K_M |
8 GB |
| Qwen2.5 14B Instruct | unsloth/Qwen2.5-14B-Instruct-GGUF:Q4_K_M |
12 GB |
| Mistral Small 3.2 24B Instruct | unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M |
18 GB |
| Qwen3 30B-A3B Instruct | unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF:Q4_K_M |
22 GB |
| Qwen2.5 32B Instruct | unsloth/Qwen2.5-32B-Instruct-GGUF:Q4_K_M |
24 GB |
| Llama 3.3 70B Instruct | unsloth/Llama-3.3-70B-Instruct-GGUF:Q4_K_M |
48 GB |
The Custom variant accepts a HuggingFace repo, optional filename, context size, GPU layer count, and extra llama-server flags. For settings that can't be expressed cleanly via the form (quoted JSON, multi-word strings), edit store.json directly.
| Interface | Port | Protocol | Type | Purpose |
|---|---|---|---|---|
| llama.cpp Server | 8080 | HTTP | ui |
Built-in chat UI + OpenAI-compatible API |
The chat UI and the API share a single port, gated by basic auth (admin + the generated password) at the proxy. Access methods (StartOS 0.4.x): LAN IP, <hostname>.local, Tor .onion, and custom domains if configured. Browsers get a native login prompt. OpenAI-compatible clients hitting the public interface use base URL <interface-url>/v1 and must supply the basic-auth credentials (e.g. curl -u admin:<password>); other StartOS services use the keyless internal http://llama-cpp.startos:8080/v1.
Selected upstream endpoints:
| Endpoint | Method | Purpose |
|---|---|---|
/v1/chat/completions |
POST | OpenAI-compatible chat |
/v1/completions |
POST | OpenAI-compatible text completion |
/v1/embeddings |
POST | Embeddings (when the loaded model supports them) |
/health |
GET | Health probe |
/props |
GET | Loaded model info |
The full surface area is documented in upstream tools/server/README.md.
| Action | Purpose |
|---|---|
| Set Model | Choose a curated preset (with hardware-tier-aware availability) or a custom HuggingFace GGUF. Writes serveArgs to store.json and restarts the daemon. |
| Set UI Password | Generate (or rotate) the web UI login password. Username is always admin. Returns the new credentials; the proxy gate picks them up automatically. |
| Delete Model Cache | Remove a specific filename from /data/models to reclaim disk space. |
None.
Included in backup:
mainvolume —store.jsonand all cached GGUF weights undermodels/.
Restore behavior:
- Serve args and any locally cached models are restored verbatim. No reconfiguration needed.
Backups can be very large depending on how many models you've cached — a single 70B Q4 file is ~40 GB.
| Check | Method | Grace period | Messages |
|---|---|---|---|
| llama.cpp API | Port listening on 8080 | 60 minutes (cold-cache model downloads) | "The llama.cpp API is ready" / "The llama.cpp API is not ready" or "No model selected. Run the "Set Model" action." |
- One model per process. llama-server holds a single GGUF in memory. To switch models, run Set Model again — the service restarts with the new weights.
- Custom-action arg splitting. The Custom variant's
Extra argumentsfield is split on whitespace, so JSON values with quoted spaces will not survive — editstore.jsondirectly for those. - Hardware-tier detection is best-effort. GPU memory is read from
nvidia-smi/rocm-smi; on Vulkan and unsupported topologies, the preset filter falls back to total system RAM as a memory budget. - Variants are independent installs. Switching from e.g.
generictonvidiais an uninstall + reinstall, not an in-place change; cached models on themainvolume can be restored from backup.
- The full
llama-serverHTTP API and built-in chat UI. - All
llama-serverCLI flags — anything not consumed by the package wrapper passes straight through (via the Custom variant's extra args). - HuggingFace
-hfmodel downloads and theLLAMA_CACHElayout. - GGUF model support, embedding endpoints, OpenAI-compatible response shapes, and tool-call formats.
See CONTRIBUTING.md for build instructions and development workflow, and UPDATING.md for the upstream-bump procedure.
package_id: llama-cpp
hardware_acceleration: true
variants: # all publish under one version; StartOS matches by detected GPU driver
generic:
image: ghcr.io/ggml-org/llama.cpp:server
arch: [x86_64, aarch64]
accel: cpu
gpu_driver: null # universal CPU fallback
nvidia:
image: ghcr.io/ggml-org/llama.cpp:server-cuda
arch: [x86_64, aarch64]
accel: cuda
gpu_driver: nvidia
rocm:
image: ghcr.io/ggml-org/llama.cpp:server-rocm
arch: [x86_64]
accel: rocm
gpu_driver: amdgpu
vulkan:
image: ghcr.io/ggml-org/llama.cpp:server-vulkan
arch: [x86_64, aarch64]
accel: vulkan
gpu_driver: i915 # Intel GPUs only
volumes:
main: /data
ports:
api_and_ui: 8080
env:
LLAMA_CACHE: /data/models
HF_HOME: /data/huggingface
dependencies: none
auth: # llama-server runs keyless; basic auth enforced at the OS reverse proxy
type: basic
username: admin # hard-coded
password: generated by set-ui-password, stored as store.json uiPassword
internal_mesh: keyless # http://llama-cpp.startos:8080 bypasses the proxy gate
startos_managed_args: ['--host 0.0.0.0', '--port 8080']
actions:
- set-model
- set-ui-password
- delete-model-cache