The story in one sentence: bring your evaluation script → EvalHub gives you reproducible runs, CI gates, and MLflow tracking for free.
A four-act live-coding demo for the lightning talk "Evaluation-driven development (EDD): AI applications that don't lie".
| Act | What you run | What the audience sees |
|---|---|---|
1 act1_green_ci.py |
uv run pytest tests/ |
CI is green. The AI returns wrong answers. Not one test asked if it was right. |
2 act2_reveal.py |
Local scorers, no server | Consistency 0.3, Domain F1 0.4, Confabulation failing. Pipeline is not production-ready. |
3 act3_edd_collection.py |
EvalHub pod + BYOP | Same scorers, now a hard CI gate. Deploy BLOCKED. Collection hr-rag-eval-v1 enforces all three. |
4 act4_production.py |
Same command, K8s scale | Same evalhub.yaml on OpenShift. 1,000 samples. Still BLOCKED. MLflow logged. |
The arc: laptop → local pod → OpenShift. The evalhub.yaml doesn't change between any of them.
uv syncActs 1 and 2 are fully functional after this step. The EvalHub SDK
(eval-hub-sdk[adapter,cli,client]) is installed automatically.
podman pull quay.io/evalhub/evalhub:latest
podman pull quay.io/wcabanba/hr-rag-adapter:latestcp .env.example .env
# Edit .env to point at a real LLM endpoint, or leave DEMO_MOCK=true (default)./scripts/reset-demo.sh # start fresh pod, verify health
./scripts/run-demo.sh # run all four acts with pauses between them./scripts/reset-demo.sh
./scripts/run-demo.sh --auto # continuous with narration
./scripts/run-demo.sh --auto --fast # same but shorter pausesRecord with asciinema:
asciinema rec demo.cast --title "EDD Demo" -- ./scripts/run-demo.sh --auto
asciinema play demo.castuv run python act1_green_ci.py # no pod needed
uv run python act2_reveal.py # no pod needed
uv run python act3_edd_collection.py # pod must be running
uv run python act4_production.py # pod must be runningscripts/run-demo.sh runs all four acts with simulated typing and narration.
| Mode | Command | Use case |
|---|---|---|
| Interactive | ./scripts/run-demo.sh |
Live presentation — ENTER to advance between acts |
| Auto | ./scripts/run-demo.sh --auto |
Continuous with narration — no input needed |
| Auto fast | ./scripts/run-demo.sh --auto --fast |
Shorter pauses for recording |
Interactive mode pauses between each act so you can:
- explain what just happened
- take audience questions
- wait for attendees to catch up
- then press ENTER to continue
Auto mode prints narration lines between acts, runs continuously, and works cleanly as an asciinema recording target. Use it as a backup if the live demo fails — or as the primary demo so you never have to type during a presentation.
The evalhub CLI (installed via uv sync) lets you interact with the running server
directly. It requires a one-time configuration to point at the local server.
uv run evalhub config set base_url http://localhost:8080
uv run evalhub config set tenant "" # must be empty — local server has no tenant isolationWhy
tenant ""? The CLI config defaults to a tenant scope. The local EvalHub server runs withdisable_auth: trueand no tenant isolation. Iftenantis set to any value, the CLI filters results to that tenant and returns "(no data)" even when providers exist. Setting it to an empty string disables tenant filtering.
The token field may be ignored by the local server (disable_auth: true) but must be
present for the CLI to function. If you see auth errors, set a dummy value:
uv run evalhub config set token "local-dev"Verify the setup:
uv run evalhub health# List registered providers
uv run evalhub providers list
# Show a specific provider and its benchmarks
uv run evalhub providers describe hr-rag-byop
# List evaluation jobs
uv run evalhub eval status
# Get results for a specific job
uv run evalhub eval results <job-id>
# Run an evaluation using the collection
uv run evalhub eval run \
--name hr-rag-test \
--model-url http://localhost:11434/v1 \
--model-name hr-rag-app \
--provider <provider-uuid> \
-b hr-consistency -b hr-domain-f1 -b hr-confabulation \
--waitThe provider UUID is shown by evalhub providers list. The demo scripts use it
automatically — the CLI commands above are for manual exploration.
./scripts/reset-demo.sh # restart pod (fresh DB), keep images — ~10 sec
./scripts/reset-demo.sh --no-pod # skip pod restart
./scripts/reset-demo.sh --full # remove + re-pull images (use if image is stale)The reset restarts the pod with a fresh in-memory SQLite database — provider registrations and job records are cleared. Container images are preserved so no re-pulling is needed between demo runs.
Acts 3 & 4 run inside a two-container pod defined in pod.yaml:
┌──────────────────── evalhub-demo pod ──────────────────────┐
│ │
│ evalhub container (:8080) hr-rag-adapter container │
│ EvalHub server file-drop watcher sidecar │
│ │
│ LocalRuntime command: adapter_watcher.py: │
│ cp $SPEC /shared/jobs/ ──► polls /shared/jobs/*.json │
│ processes each spec │
│ ◄────────────────────────── POST localhost:8080/callbacks │
│ (shared pod localhost — no extra networking) │
│ │
│ emptyDir /shared/jobs (shared between both containers) │
└─────────────────────────────────────────────────────────────┘
Start / stop:
podman play kube pod.yaml # start (podman < 4.x)
podman kube play pod.yaml # start (podman >= 4.x)
podman play kube --down pod.yaml # stop
podman kube down pod.yaml # stop (podman >= 4.x)Verify:
curl http://localhost:8080/api/v1/healthThe same pod.yaml applies to OpenShift with oc apply -f pod.yaml —
swap the hostPort for a Route and image refs for your internal registry.
providers/hr_rag_adapter.py ← your evaluation script (same scorers as Act 2)
extends FrameworkAdapter
implements run_benchmark_job()
runs consistency_score() + domain_f1() + confabulation_detected()
│
│ EvalHub drops job spec → /shared/jobs/
│ adapter_watcher.py picks it up, runs the adapter
│ results POSTed back to EvalHub at localhost:8080
▼
EvalHub server (http://localhost:8080)
provider: hr-rag-byop
collection: hr-rag-eval-v1 (3 benchmarks, defined once)
enforces gate: all_must_pass
│
│ same evalhub.yaml, zero code changes
▼
OpenShift / Kubernetes (Act 4)
adapter runs as a Kubernetes Job (not a sidecar)
same gate, same MLflow tracking, 1,000-sample scale
EvalHub built-in providers test the base model (LM Eval Harness, Garak, LightEval, GuideLLM). BYOP is for evaluating your application — RAG pipelines, agents, summarisers — with your own scorers.
| ID | Category | Metric | Threshold | Catches |
|---|---|---|---|---|
hr-consistency |
reliability | Jaccard (0–1) | ≥ 0.50 | Non-determinism |
hr-domain-f1 |
capability | Token F1 (0–1) | ≥ 0.80 | Wrong dates, missing facts |
hr-confabulation |
safety | Pass rate (0–1) | ≥ 0.90 | Hallucinated dates/numbers |
All three are registered in the hr-rag-eval-v1 collection. Act 2 runs them locally
(instant, no server). Act 3 enforces them as a hard gate via EvalHub.
| Setting | Behaviour |
|---|---|
DEMO_MOCK=true (default) |
Pre-written answers. Fully offline. Reproducible failures. |
DEMO_MOCK=false |
Real LLM via OPENAI_BASE_URL + OPENAI_API_KEY. |
The mock pool includes deliberately wrong dates (October 31, November 30) so
the scorers always fail — reproducible regardless of model behaviour.
| Script | Purpose |
|---|---|
scripts/run-demo.sh |
Run all four acts (interactive or auto/recording mode) |
scripts/reset-demo.sh |
Reset demo to clean state (preserve images) |
scripts/build-push-adapter.sh |
Build & push multi-arch adapter image to quay.io |
edd-demo/
├── act1_green_ci.py ← TDD illusion: CI passes, AI is wrong
├── act2_reveal.py ← Local scorers: consistency + domain F1 + confabulation
├── act3_edd_collection.py ← BYOP registration + collection + EvalHub gate
├── act4_production.py ← Same gate, OpenShift scale, MLflow audit trail
│
├── rag_app/pipeline.py ← The RAG pipeline being evaluated
├── data/policy.md ← HR benefits policy (the knowledge base)
├── tests/test_pipeline.py ← The "green" TDD tests (Act 1 subject)
│
├── providers/
│ ├── hr_rag_adapter.py ← BYOP FrameworkAdapter — the script you bring
│ └── adapter_watcher.py ← File-drop watcher (runs in the sidecar container)
│
├── config/config.yaml ← EvalHub server config (mounted into pod)
├── pod.yaml ← Two-container pod (EvalHub + adapter sidecar)
├── evalhub.yaml ← Provider + collection + job definition
├── Containerfile.adapter ← Builds the adapter sidecar image
├── PRESENTER_GUIDE.md ← Presenter guide: act-by-act script + recovery playbook
├── pyproject.toml ← Python dependencies (uv sync)
└── .env.example ← Environment variable template
See PRESENTER_GUIDE.md for the full live-presentation script including:
- Slide-by-slide speaker notes
- Exact words for each act transition
- Audience interaction moments
- Recovery playbook (what to do when things break)