EDD Demo — Evaluation-Driven Development with EvalHub

The story in one sentence: bring your evaluation script → EvalHub gives you reproducible runs, CI gates, and MLflow tracking for free.

A four-act live-coding demo for the lightning talk "Evaluation-driven development (EDD): AI applications that don't lie".

What each act shows

Act	What you run	What the audience sees
1 `act1_green_ci.py`	`uv run pytest tests/`	CI is green. The AI returns wrong answers. Not one test asked if it was right.
2 `act2_reveal.py`	Local scorers, no server	Consistency 0.3, Domain F1 0.4, Confabulation failing. Pipeline is not production-ready.
3 `act3_edd_collection.py`	EvalHub pod + BYOP	Same scorers, now a hard CI gate. Deploy BLOCKED. Collection `hr-rag-eval-v1` enforces all three.
4 `act4_production.py`	Same command, K8s scale	Same `evalhub.yaml` on OpenShift. 1,000 samples. Still BLOCKED. MLflow logged.

The arc: laptop → local pod → OpenShift. The evalhub.yaml doesn't change between any of them.

Requirements

Python 3.11+
uv — brew install uv or pip install uv
podman — Acts 3 & 4 only

Setup

1 — Install Python dependencies

uv sync

Acts 1 and 2 are fully functional after this step. The EvalHub SDK (eval-hub-sdk[adapter,cli,client]) is installed automatically.

2 — Pull container images (Acts 3 & 4)

podman pull quay.io/evalhub/evalhub:latest
podman pull quay.io/wcabanba/hr-rag-adapter:latest

3 — Configure (optional)

cp .env.example .env
# Edit .env to point at a real LLM endpoint, or leave DEMO_MOCK=true (default)

Running the demo

Quick start — interactive mode

./scripts/reset-demo.sh    # start fresh pod, verify health
./scripts/run-demo.sh      # run all four acts with pauses between them

Quick start — auto mode (for recording)

./scripts/reset-demo.sh
./scripts/run-demo.sh --auto        # continuous with narration
./scripts/run-demo.sh --auto --fast # same but shorter pauses

Record with asciinema:

asciinema rec demo.cast --title "EDD Demo" -- ./scripts/run-demo.sh --auto
asciinema play demo.cast

Run acts individually

uv run python act1_green_ci.py    # no pod needed
uv run python act2_reveal.py      # no pod needed
uv run python act3_edd_collection.py   # pod must be running
uv run python act4_production.py       # pod must be running

Demo runner script

scripts/run-demo.sh runs all four acts with simulated typing and narration.

Mode	Command	Use case
Interactive	`./scripts/run-demo.sh`	Live presentation — ENTER to advance between acts
Auto	`./scripts/run-demo.sh --auto`	Continuous with narration — no input needed
Auto fast	`./scripts/run-demo.sh --auto --fast`	Shorter pauses for recording

Interactive mode pauses between each act so you can:

explain what just happened
take audience questions
wait for attendees to catch up
then press ENTER to continue

Auto mode prints narration lines between acts, runs continuously, and works cleanly as an asciinema recording target. Use it as a backup if the live demo fails — or as the primary demo so you never have to type during a presentation.

Using the EvalHub CLI

The evalhub CLI (installed via uv sync) lets you interact with the running server directly. It requires a one-time configuration to point at the local server.

One-time CLI setup

uv run evalhub config set base_url http://localhost:8080
uv run evalhub config set tenant ""    # must be empty — local server has no tenant isolation

Why tenant ""? The CLI config defaults to a tenant scope. The local EvalHub server runs with disable_auth: true and no tenant isolation. If tenant is set to any value, the CLI filters results to that tenant and returns "(no data)" even when providers exist. Setting it to an empty string disables tenant filtering.

The token field may be ignored by the local server (disable_auth: true) but must be present for the CLI to function. If you see auth errors, set a dummy value:

uv run evalhub config set token "local-dev"

Verify the setup:

uv run evalhub health

Useful CLI commands

# List registered providers
uv run evalhub providers list

# Show a specific provider and its benchmarks
uv run evalhub providers describe hr-rag-byop

# List evaluation jobs
uv run evalhub eval status

# Get results for a specific job
uv run evalhub eval results <job-id>

# Run an evaluation using the collection
uv run evalhub eval run \
  --name hr-rag-test \
  --model-url http://localhost:11434/v1 \
  --model-name hr-rag-app \
  --provider <provider-uuid> \
  -b hr-consistency -b hr-domain-f1 -b hr-confabulation \
  --wait

The provider UUID is shown by evalhub providers list. The demo scripts use it automatically — the CLI commands above are for manual exploration.

Resetting between runs

./scripts/reset-demo.sh           # restart pod (fresh DB), keep images — ~10 sec
./scripts/reset-demo.sh --no-pod  # skip pod restart
./scripts/reset-demo.sh --full    # remove + re-pull images (use if image is stale)

The reset restarts the pod with a fresh in-memory SQLite database — provider registrations and job records are cleared. Container images are preserved so no re-pulling is needed between demo runs.

The EvalHub pod

Acts 3 & 4 run inside a two-container pod defined in pod.yaml:

┌──────────────────── evalhub-demo pod ──────────────────────┐
│                                                             │
│  evalhub container (:8080)    hr-rag-adapter container      │
│  EvalHub server               file-drop watcher sidecar     │
│                                                             │
│  LocalRuntime command:        adapter_watcher.py:           │
│  cp $SPEC /shared/jobs/  ──►  polls /shared/jobs/*.json     │
│                               processes each spec           │
│  ◄──────────────────────────  POST localhost:8080/callbacks │
│         (shared pod localhost — no extra networking)        │
│                                                             │
│  emptyDir /shared/jobs  (shared between both containers)    │
└─────────────────────────────────────────────────────────────┘

Start / stop:

podman play kube pod.yaml          # start (podman < 4.x)
podman kube play pod.yaml          # start (podman >= 4.x)

podman play kube --down pod.yaml   # stop
podman kube down pod.yaml          # stop (podman >= 4.x)

Verify:

curl http://localhost:8080/api/v1/health

The same pod.yaml applies to OpenShift with oc apply -f pod.yaml — swap the hostPort for a Route and image refs for your internal registry.

How Bring Your Own Provider (BYOP) works

providers/hr_rag_adapter.py    ← your evaluation script (same scorers as Act 2)
    extends FrameworkAdapter
    implements run_benchmark_job()
    runs consistency_score() + domain_f1() + confabulation_detected()
         │
         │  EvalHub drops job spec → /shared/jobs/
         │  adapter_watcher.py picks it up, runs the adapter
         │  results POSTed back to EvalHub at localhost:8080
         ▼
EvalHub server (http://localhost:8080)
    provider: hr-rag-byop
    collection: hr-rag-eval-v1  (3 benchmarks, defined once)
    enforces gate: all_must_pass
         │
         │  same evalhub.yaml, zero code changes
         ▼
OpenShift / Kubernetes (Act 4)
    adapter runs as a Kubernetes Job (not a sidecar)
    same gate, same MLflow tracking, 1,000-sample scale

EvalHub built-in providers test the base model (LM Eval Harness, Garak, LightEval, GuideLLM). BYOP is for evaluating your application — RAG pipelines, agents, summarisers — with your own scorers.

The three BYOP benchmarks

ID	Category	Metric	Threshold	Catches
`hr-consistency`	reliability	Jaccard (0–1)	≥ 0.50	Non-determinism
`hr-domain-f1`	capability	Token F1 (0–1)	≥ 0.80	Wrong dates, missing facts
`hr-confabulation`	safety	Pass rate (0–1)	≥ 0.90	Hallucinated dates/numbers

All three are registered in the hr-rag-eval-v1 collection. Act 2 runs them locally (instant, no server). Act 3 enforces them as a hard gate via EvalHub.

Mock mode vs real LLM

Setting	Behaviour
`DEMO_MOCK=true` (default)	Pre-written answers. Fully offline. Reproducible failures.
`DEMO_MOCK=false`	Real LLM via `OPENAI_BASE_URL` + `OPENAI_API_KEY`.

The mock pool includes deliberately wrong dates (October 31, November 30) so the scorers always fail — reproducible regardless of model behaviour.

Scripts

Script	Purpose
`scripts/run-demo.sh`	Run all four acts (interactive or auto/recording mode)
`scripts/reset-demo.sh`	Reset demo to clean state (preserve images)
`scripts/build-push-adapter.sh`	Build & push multi-arch adapter image to quay.io

Files at a glance

edd-demo/
├── act1_green_ci.py           ← TDD illusion: CI passes, AI is wrong
├── act2_reveal.py             ← Local scorers: consistency + domain F1 + confabulation
├── act3_edd_collection.py     ← BYOP registration + collection + EvalHub gate
├── act4_production.py         ← Same gate, OpenShift scale, MLflow audit trail
│
├── rag_app/pipeline.py        ← The RAG pipeline being evaluated
├── data/policy.md             ← HR benefits policy (the knowledge base)
├── tests/test_pipeline.py     ← The "green" TDD tests (Act 1 subject)
│
├── providers/
│   ├── hr_rag_adapter.py      ← BYOP FrameworkAdapter — the script you bring
│   └── adapter_watcher.py     ← File-drop watcher (runs in the sidecar container)
│
├── config/config.yaml         ← EvalHub server config (mounted into pod)
├── pod.yaml                   ← Two-container pod (EvalHub + adapter sidecar)
├── evalhub.yaml               ← Provider + collection + job definition
├── Containerfile.adapter      ← Builds the adapter sidecar image
├── PRESENTER_GUIDE.md                   ← Presenter guide: act-by-act script + recovery playbook
├── pyproject.toml             ← Python dependencies (uv sync)
└── .env.example               ← Environment variable template

Presenter guide

See PRESENTER_GUIDE.md for the full live-presentation script including:

Slide-by-slide speaker notes
Exact words for each act transition
Audience interaction moments
Recovery playbook (what to do when things break)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EDD Demo — Evaluation-Driven Development with EvalHub

What each act shows

Requirements

Setup

1 — Install Python dependencies

2 — Pull container images (Acts 3 & 4)

3 — Configure (optional)

Running the demo

Quick start — interactive mode

Quick start — auto mode (for recording)

Run acts individually

Demo runner script

Using the EvalHub CLI

One-time CLI setup

Useful CLI commands

Resetting between runs

The EvalHub pod

How Bring Your Own Provider (BYOP) works

The three BYOP benchmarks

Mock mode vs real LLM

Scripts

Files at a glance

Presenter guide

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
data		data
providers		providers
rag_app		rag_app
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Containerfile.adapter		Containerfile.adapter
PRESENTER_GUIDE.md		PRESENTER_GUIDE.md
README.md		README.md
act1_green_ci.py		act1_green_ci.py
act2_reveal.py		act2_reveal.py
act3_edd_collection.py		act3_edd_collection.py
act4_production.py		act4_production.py
evalhub.yaml		evalhub.yaml
pod.yaml		pod.yaml
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

EDD Demo — Evaluation-Driven Development with EvalHub

What each act shows

Requirements

Setup

1 — Install Python dependencies

2 — Pull container images (Acts 3 & 4)

3 — Configure (optional)

Running the demo

Quick start — interactive mode

Quick start — auto mode (for recording)

Run acts individually

Demo runner script

Using the EvalHub CLI

One-time CLI setup

Useful CLI commands

Resetting between runs

The EvalHub pod

How Bring Your Own Provider (BYOP) works

The three BYOP benchmarks

Mock mode vs real LLM

Scripts

Files at a glance

Presenter guide

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages