llama.cpp on StartOS

Upstream repo: https://github.com/ggml-org/llama.cpp

Upstream llama-server docs: https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md

Everything not listed here behaves the same as upstream llama-server. If a flag, endpoint, or behavior is not mentioned in this document, upstream documentation is accurate and fully applicable.

llama.cpp is a high-performance C/C++ runtime for large language models in GGUF format. This package wraps its built-in HTTP server (llama-server), which exposes an OpenAI-compatible API and a small in-browser chat UI on the same port.

Image and Container Runtime

The package ships four variants, selected at build time via the VARIANT env var (driven by the Makefile):

Variant	Image	Arches	Accelerator	Offered to GPU driver
`generic`	`ghcr.io/ggml-org/llama.cpp:server`	x86_64, aarch64	CPU only	— (universal fallback)
`nvidia`	`ghcr.io/ggml-org/llama.cpp:server-cuda`	x86_64, aarch64	CUDA (NVIDIA)	`nvidia`
`rocm`	`ghcr.io/ggml-org/llama.cpp:server-rocm`	x86_64	ROCm (AMD)	`amdgpu`
`vulkan`	`ghcr.io/ggml-org/llama.cpp:server-vulkan`	x86_64, aarch64	Vulkan	`i915` (Intel)

All four variants publish under a single package version. Each declares a distinct hardwareRequirements.device (the host GPU's kernel driver), so StartOS serves each host the most specific variant its detected hardware satisfies — nvidia/rocm/vulkan for matching GPUs, and generic as the universal CPU fallback for everything else. Note that vulkan matches only Intel GPUs on the i915 driver; newer Intel GPUs on the xe driver (and non-Intel Vulkan-only setups) fall back to generic.

Property	Value
Entrypoint	`/app/llama-server`
Working dir	`/app`
Default port	8080

Volume and Data Layout

Volume	Mount Point	Purpose
`main`	`/data`	`store.json` (serve args) and `models/` (GGUF cache)

The container runs with LLAMA_CACHE=/data/models and HF_HOME=/data/huggingface, so all -hf <repo> downloads land on the persistent volume.

Installation and First-Run Flow

Step	StartOS
Install	Marketplace install or sideload `.s9pk`
First-run tasks	Two `critical` tasks: Set UI Password (created whenever no password is set) and Set Model (created whenever no model is selected). Both are created on install and re-surface if the underlying value is later cleared.
Start service	After Set Model has been run; until then the daemon idles
Pull the model	Automatic on first start (cached on the `main` volume)

Until Set Model has been run, the daemon stays in an idle (sleep infinity) state and the API port is closed — the health check reports "No model selected." Once a model is selected, llama-server is restarted with the chosen serve arguments.

Configuration Management

Serve configuration is stored at /data/store.json and managed via the Set Model action:

{
  "serveArgs": [
    "-hf",
    "unsloth/Qwen2.5-7B-Instruct-GGUF:Q4_K_M",
    "-c",
    "8192",
    "-ngl",
    "999"
  ]
}

serveArgs is the exact list of arguments appended after /app/llama-server. The daemon adds --host 0.0.0.0 and --port 8080 at runtime.

llama-server itself runs keyless — no --api-key. Access is instead gated by HTTP basic auth enforced at the StartOS reverse proxy (addSsl.auth): the OS validates credentials before any request reaches the container. The username is hard-coded to admin; the password is generated by the Set UI Password action and stored as uiPassword in store.json. setupInterfaces reads it reactively, so rotating it via the action takes effect without a manual restart. Set UI Password is a critical task, which blocks the service from starting until a password is set — so the service never runs (and the gate never serves) without one.

Dependent StartOS services reach llama.cpp over the internal service mesh (http://llama-cpp.startos:8080), which is not behind the proxy gate, so they connect keyless.

Curated presets: the Set Model action surfaces a hardware-tier-aware list of GGUF presets and disables ones too large for the detected memory:

Preset	Repo (`-hf`)	Min memory
Llama 3.2 1B Instruct	`unsloth/Llama-3.2-1B-Instruct-GGUF:Q4_K_M`	2 GB
Llama 3.2 3B Instruct	`unsloth/Llama-3.2-3B-Instruct-GGUF:Q4_K_M`	4 GB
Qwen2.5 7B Instruct	`unsloth/Qwen2.5-7B-Instruct-GGUF:Q4_K_M`	6 GB
Llama 3.1 8B Instruct	`unsloth/Meta-Llama-3.1-8B-Instruct-GGUF:Q4_K_M`	8 GB
Qwen2.5 14B Instruct	`unsloth/Qwen2.5-14B-Instruct-GGUF:Q4_K_M`	12 GB
Mistral Small 3.2 24B Instruct	`unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M`	18 GB
Qwen3 30B-A3B Instruct	`unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF:Q4_K_M`	22 GB
Qwen2.5 32B Instruct	`unsloth/Qwen2.5-32B-Instruct-GGUF:Q4_K_M`	24 GB
Llama 3.3 70B Instruct	`unsloth/Llama-3.3-70B-Instruct-GGUF:Q4_K_M`	48 GB

The Custom variant accepts a HuggingFace repo, optional filename, context size, GPU layer count, and extra llama-server flags. For settings that can't be expressed cleanly via the form (quoted JSON, multi-word strings), edit store.json directly.

Network Access and Interfaces

Interface	Port	Protocol	Type	Purpose
llama.cpp Server	8080	HTTP	`ui`	Built-in chat UI + OpenAI-compatible API

The chat UI and the API share a single port, gated by basic auth (admin + the generated password) at the proxy. Access methods (StartOS 0.4.x): LAN IP, <hostname>.local, Tor .onion, and custom domains if configured. Browsers get a native login prompt. OpenAI-compatible clients hitting the public interface use base URL <interface-url>/v1 and must supply the basic-auth credentials (e.g. curl -u admin:<password>); other StartOS services use the keyless internal http://llama-cpp.startos:8080/v1.

Selected upstream endpoints:

Endpoint	Method	Purpose
`/v1/chat/completions`	POST	OpenAI-compatible chat
`/v1/completions`	POST	OpenAI-compatible text completion
`/v1/embeddings`	POST	Embeddings (when the loaded model supports them)
`/health`	GET	Health probe
`/props`	GET	Loaded model info

The full surface area is documented in upstream tools/server/README.md.

Actions (StartOS UI)

Action	Purpose
Set Model	Choose a curated preset (with hardware-tier-aware availability) or a custom HuggingFace GGUF. Writes `serveArgs` to `store.json` and restarts the daemon.
Set UI Password	Generate (or rotate) the web UI login password. Username is always `admin`. Returns the new credentials; the proxy gate picks them up automatically.
Delete Model Cache	Remove a specific filename from `/data/models` to reclaim disk space.

Dependencies

None.

Backups and Restore

Included in backup:

main volume — store.json and all cached GGUF weights under models/.

Restore behavior:

Serve args and any locally cached models are restored verbatim. No reconfiguration needed.

Backups can be very large depending on how many models you've cached — a single 70B Q4 file is ~40 GB.

Health Checks

Check	Method	Grace period	Messages
llama.cpp API	Port listening on 8080	60 minutes (cold-cache model downloads)	"The llama.cpp API is ready" / "The llama.cpp API is not ready" or "No model selected. Run the "Set Model" action."

Limitations and Differences

One model per process. llama-server holds a single GGUF in memory. To switch models, run Set Model again — the service restarts with the new weights.
Custom-action arg splitting. The Custom variant's Extra arguments field is split on whitespace, so JSON values with quoted spaces will not survive — edit store.json directly for those.
Hardware-tier detection is best-effort. GPU memory is read from nvidia-smi / rocm-smi; on Vulkan and unsupported topologies, the preset filter falls back to total system RAM as a memory budget.
Variants are independent installs. Switching from e.g. generic to nvidia is an uninstall + reinstall, not an in-place change; cached models on the main volume can be restored from backup.

What Is Unchanged from Upstream

The full llama-server HTTP API and built-in chat UI.
All llama-server CLI flags — anything not consumed by the package wrapper passes straight through (via the Custom variant's extra args).
HuggingFace -hf model downloads and the LLAMA_CACHE layout.
GGUF model support, embedding endpoints, OpenAI-compatible response shapes, and tool-call formats.

Contributing

See CONTRIBUTING.md for build instructions and development workflow, and UPDATING.md for the upstream-bump procedure.

Quick Reference for AI Consumers

package_id: llama-cpp
hardware_acceleration: true
variants: # all publish under one version; StartOS matches by detected GPU driver
  generic:
    image: ghcr.io/ggml-org/llama.cpp:server
    arch: [x86_64, aarch64]
    accel: cpu
    gpu_driver: null # universal CPU fallback
  nvidia:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    arch: [x86_64, aarch64]
    accel: cuda
    gpu_driver: nvidia
  rocm:
    image: ghcr.io/ggml-org/llama.cpp:server-rocm
    arch: [x86_64]
    accel: rocm
    gpu_driver: amdgpu
  vulkan:
    image: ghcr.io/ggml-org/llama.cpp:server-vulkan
    arch: [x86_64, aarch64]
    accel: vulkan
    gpu_driver: i915 # Intel GPUs only
volumes:
  main: /data
ports:
  api_and_ui: 8080
env:
  LLAMA_CACHE: /data/models
  HF_HOME: /data/huggingface
dependencies: none
auth: # llama-server runs keyless; basic auth enforced at the OS reverse proxy
  type: basic
  username: admin # hard-coded
  password: generated by set-ui-password, stored as store.json uiPassword
  internal_mesh: keyless # http://llama-cpp.startos:8080 bypasses the proxy gate
startos_managed_args: ['--host 0.0.0.0', '--port 8080']
actions:
  - set-model
  - set-ui-password
  - delete-model-cache

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
assets		assets
startos		startos
.dockerignore		.dockerignore
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
TODO.md		TODO.md
UPDATING.md		UPDATING.md
icon.png		icon.png
instructions.md		instructions.md
package-lock.json		package-lock.json
package.json		package.json
s9pk.mk		s9pk.mk
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama.cpp on StartOS

Table of Contents

Image and Container Runtime

Volume and Data Layout

Installation and First-Run Flow

Configuration Management

Network Access and Interfaces

Actions (StartOS UI)

Dependencies

Backups and Restore

Health Checks

Limitations and Differences

What Is Unchanged from Upstream

Contributing

Quick Reference for AI Consumers

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llama.cpp on StartOS

Table of Contents

Image and Container Runtime

Volume and Data Layout

Installation and First-Run Flow

Configuration Management

Network Access and Interfaces

Actions (StartOS UI)

Dependencies

Backups and Restore

Health Checks

Limitations and Differences

What Is Unchanged from Upstream

Contributing

Quick Reference for AI Consumers

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages