Skip to content

williamcaban/edd-demo

Repository files navigation

EDD Demo — Evaluation-Driven Development with EvalHub

The story in one sentence: bring your evaluation script → EvalHub gives you reproducible runs, CI gates, and MLflow tracking for free.

A four-act live-coding demo for the lightning talk "Evaluation-driven development (EDD): AI applications that don't lie".


What each act shows

Act What you run What the audience sees
1 act1_green_ci.py uv run pytest tests/ CI is green. The AI returns wrong answers. Not one test asked if it was right.
2 act2_reveal.py Local scorers, no server Consistency 0.3, Domain F1 0.4, Confabulation failing. Pipeline is not production-ready.
3 act3_edd_collection.py EvalHub pod + BYOP Same scorers, now a hard CI gate. Deploy BLOCKED. Collection hr-rag-eval-v1 enforces all three.
4 act4_production.py Same command, K8s scale Same evalhub.yaml on OpenShift. 1,000 samples. Still BLOCKED. MLflow logged.

The arc: laptop → local pod → OpenShift. The evalhub.yaml doesn't change between any of them.


Requirements

  • Python 3.11+
  • uvbrew install uv or pip install uv
  • podman — Acts 3 & 4 only

Setup

1 — Install Python dependencies

uv sync

Acts 1 and 2 are fully functional after this step. The EvalHub SDK (eval-hub-sdk[adapter,cli,client]) is installed automatically.

2 — Pull container images (Acts 3 & 4)

podman pull quay.io/evalhub/evalhub:latest
podman pull quay.io/wcabanba/hr-rag-adapter:latest

3 — Configure (optional)

cp .env.example .env
# Edit .env to point at a real LLM endpoint, or leave DEMO_MOCK=true (default)

Running the demo

Quick start — interactive mode

./scripts/reset-demo.sh    # start fresh pod, verify health
./scripts/run-demo.sh      # run all four acts with pauses between them

Quick start — auto mode (for recording)

./scripts/reset-demo.sh
./scripts/run-demo.sh --auto        # continuous with narration
./scripts/run-demo.sh --auto --fast # same but shorter pauses

Record with asciinema:

asciinema rec demo.cast --title "EDD Demo" -- ./scripts/run-demo.sh --auto
asciinema play demo.cast

Run acts individually

uv run python act1_green_ci.py    # no pod needed
uv run python act2_reveal.py      # no pod needed
uv run python act3_edd_collection.py   # pod must be running
uv run python act4_production.py       # pod must be running

Demo runner script

scripts/run-demo.sh runs all four acts with simulated typing and narration.

Mode Command Use case
Interactive ./scripts/run-demo.sh Live presentation — ENTER to advance between acts
Auto ./scripts/run-demo.sh --auto Continuous with narration — no input needed
Auto fast ./scripts/run-demo.sh --auto --fast Shorter pauses for recording

Interactive mode pauses between each act so you can:

  • explain what just happened
  • take audience questions
  • wait for attendees to catch up
  • then press ENTER to continue

Auto mode prints narration lines between acts, runs continuously, and works cleanly as an asciinema recording target. Use it as a backup if the live demo fails — or as the primary demo so you never have to type during a presentation.


Using the EvalHub CLI

The evalhub CLI (installed via uv sync) lets you interact with the running server directly. It requires a one-time configuration to point at the local server.

One-time CLI setup

uv run evalhub config set base_url http://localhost:8080
uv run evalhub config set tenant ""    # must be empty — local server has no tenant isolation

Why tenant ""? The CLI config defaults to a tenant scope. The local EvalHub server runs with disable_auth: true and no tenant isolation. If tenant is set to any value, the CLI filters results to that tenant and returns "(no data)" even when providers exist. Setting it to an empty string disables tenant filtering.

The token field may be ignored by the local server (disable_auth: true) but must be present for the CLI to function. If you see auth errors, set a dummy value:

uv run evalhub config set token "local-dev"

Verify the setup:

uv run evalhub health

Useful CLI commands

# List registered providers
uv run evalhub providers list

# Show a specific provider and its benchmarks
uv run evalhub providers describe hr-rag-byop

# List evaluation jobs
uv run evalhub eval status

# Get results for a specific job
uv run evalhub eval results <job-id>

# Run an evaluation using the collection
uv run evalhub eval run \
  --name hr-rag-test \
  --model-url http://localhost:11434/v1 \
  --model-name hr-rag-app \
  --provider <provider-uuid> \
  -b hr-consistency -b hr-domain-f1 -b hr-confabulation \
  --wait

The provider UUID is shown by evalhub providers list. The demo scripts use it automatically — the CLI commands above are for manual exploration.


Resetting between runs

./scripts/reset-demo.sh           # restart pod (fresh DB), keep images — ~10 sec
./scripts/reset-demo.sh --no-pod  # skip pod restart
./scripts/reset-demo.sh --full    # remove + re-pull images (use if image is stale)

The reset restarts the pod with a fresh in-memory SQLite database — provider registrations and job records are cleared. Container images are preserved so no re-pulling is needed between demo runs.


The EvalHub pod

Acts 3 & 4 run inside a two-container pod defined in pod.yaml:

┌──────────────────── evalhub-demo pod ──────────────────────┐
│                                                             │
│  evalhub container (:8080)    hr-rag-adapter container      │
│  EvalHub server               file-drop watcher sidecar     │
│                                                             │
│  LocalRuntime command:        adapter_watcher.py:           │
│  cp $SPEC /shared/jobs/  ──►  polls /shared/jobs/*.json     │
│                               processes each spec           │
│  ◄──────────────────────────  POST localhost:8080/callbacks │
│         (shared pod localhost — no extra networking)        │
│                                                             │
│  emptyDir /shared/jobs  (shared between both containers)    │
└─────────────────────────────────────────────────────────────┘

Start / stop:

podman play kube pod.yaml          # start (podman < 4.x)
podman kube play pod.yaml          # start (podman >= 4.x)

podman play kube --down pod.yaml   # stop
podman kube down pod.yaml          # stop (podman >= 4.x)

Verify:

curl http://localhost:8080/api/v1/health

The same pod.yaml applies to OpenShift with oc apply -f pod.yaml — swap the hostPort for a Route and image refs for your internal registry.


How Bring Your Own Provider (BYOP) works

providers/hr_rag_adapter.py    ← your evaluation script (same scorers as Act 2)
    extends FrameworkAdapter
    implements run_benchmark_job()
    runs consistency_score() + domain_f1() + confabulation_detected()
         │
         │  EvalHub drops job spec → /shared/jobs/
         │  adapter_watcher.py picks it up, runs the adapter
         │  results POSTed back to EvalHub at localhost:8080
         ▼
EvalHub server (http://localhost:8080)
    provider: hr-rag-byop
    collection: hr-rag-eval-v1  (3 benchmarks, defined once)
    enforces gate: all_must_pass
         │
         │  same evalhub.yaml, zero code changes
         ▼
OpenShift / Kubernetes (Act 4)
    adapter runs as a Kubernetes Job (not a sidecar)
    same gate, same MLflow tracking, 1,000-sample scale

EvalHub built-in providers test the base model (LM Eval Harness, Garak, LightEval, GuideLLM). BYOP is for evaluating your application — RAG pipelines, agents, summarisers — with your own scorers.


The three BYOP benchmarks

ID Category Metric Threshold Catches
hr-consistency reliability Jaccard (0–1) ≥ 0.50 Non-determinism
hr-domain-f1 capability Token F1 (0–1) ≥ 0.80 Wrong dates, missing facts
hr-confabulation safety Pass rate (0–1) ≥ 0.90 Hallucinated dates/numbers

All three are registered in the hr-rag-eval-v1 collection. Act 2 runs them locally (instant, no server). Act 3 enforces them as a hard gate via EvalHub.


Mock mode vs real LLM

Setting Behaviour
DEMO_MOCK=true (default) Pre-written answers. Fully offline. Reproducible failures.
DEMO_MOCK=false Real LLM via OPENAI_BASE_URL + OPENAI_API_KEY.

The mock pool includes deliberately wrong dates (October 31, November 30) so the scorers always fail — reproducible regardless of model behaviour.


Scripts

Script Purpose
scripts/run-demo.sh Run all four acts (interactive or auto/recording mode)
scripts/reset-demo.sh Reset demo to clean state (preserve images)
scripts/build-push-adapter.sh Build & push multi-arch adapter image to quay.io

Files at a glance

edd-demo/
├── act1_green_ci.py           ← TDD illusion: CI passes, AI is wrong
├── act2_reveal.py             ← Local scorers: consistency + domain F1 + confabulation
├── act3_edd_collection.py     ← BYOP registration + collection + EvalHub gate
├── act4_production.py         ← Same gate, OpenShift scale, MLflow audit trail
│
├── rag_app/pipeline.py        ← The RAG pipeline being evaluated
├── data/policy.md             ← HR benefits policy (the knowledge base)
├── tests/test_pipeline.py     ← The "green" TDD tests (Act 1 subject)
│
├── providers/
│   ├── hr_rag_adapter.py      ← BYOP FrameworkAdapter — the script you bring
│   └── adapter_watcher.py     ← File-drop watcher (runs in the sidecar container)
│
├── config/config.yaml         ← EvalHub server config (mounted into pod)
├── pod.yaml                   ← Two-container pod (EvalHub + adapter sidecar)
├── evalhub.yaml               ← Provider + collection + job definition
├── Containerfile.adapter      ← Builds the adapter sidecar image
├── PRESENTER_GUIDE.md                   ← Presenter guide: act-by-act script + recovery playbook
├── pyproject.toml             ← Python dependencies (uv sync)
└── .env.example               ← Environment variable template

Presenter guide

See PRESENTER_GUIDE.md for the full live-presentation script including:

  • Slide-by-slide speaker notes
  • Exact words for each act transition
  • Audience interaction moments
  • Recovery playbook (what to do when things break)

About

Evaluation-Driven Development (EDD) with EvalHub

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors