The one-line story: Three commands take you from a broken RAG pipeline on your laptop to the same evaluation running as a hard CI gate on OpenShift.
LAPTOP LOCAL POD OPENSHIFT
────────────────────── ────────────────────── ──────────────────────
uv run act1_green_ci.py podman play kube oc apply -f pod.yaml
uv run act2_reveal.py pod.yaml
uv run act3_edd_... (same evalhub.yaml)
↓ ↓ ↓
No infra EvalHub + adapter EvalHub Deployment
Python + uv only CI gate enforced Kubernetes Jobs
Local scorers Collection hr-rag-eval-v1 1,000-sample scale
(immediate feedback) (reproducible + tracked) (MLflow audit trail)
The audience insight: the evalhub.yaml file doesn't change between environments.
The BYOP adapter image runs as a sidecar locally and as a K8s Job on OpenShift.
Same code. Same thresholds. Same gate. Different scale.
Run this 15 minutes before going on stage:
cd edd-demo/
# 1. Verify Python deps
uv run python -c "from rag_app.pipeline import RAGPipeline; print('OK')"
# 2. Start a clean pod (pulls images if not cached, starts EvalHub + adapter)
./scripts/reset-demo.sh
# 3. Configure the evalhub CLI — ONE-TIME ONLY, survives reboots
# Skip if already configured. tenant must be empty or CLI returns "(no data)".
uv run evalhub config set base_url http://localhost:8080
uv run evalhub config set tenant ""
uv run evalhub health # should show "healthy"
# 4. Warm up mock randomness (confirms pipeline + scorers run cleanly)
uv run python act1_green_ci.py
uv run python act2_reveal.py
# 5. Keep terminal font at 18pt+, white-on-dark themeHave open in separate tabs/panes:
- Tab 1: terminal in
edd-demo/(demo runs here) - Tab 2:
cat evalhub.yamlready to show - Tab 3:
podman ps --filter name=evalhubto show the running pod
Show the subtitle: "CI is green. Your AI is wrong. EvalHub catches both — live."
"Quick show of hands — who's had CI go green while their AI shipped the wrong answer? Yeah. That's what we fix today. Laptops out — you can follow along."
Walk the checklist fast. Land on:
"Three weeks in. Wrong eligibility date. No error. No alert. CI: still green. Nobody tested whether the answer was right."
"Three assumptions you made when you wrote your first AI test — all wrong. Same input, different outputs. Pass/fail doesn't apply to a 0–1 score. Your fixtures don't cover what the model doesn't know. Your tests passed. Your AI was still wrong."
"The fix is a methodology shift. EDD — same discipline as TDD, different primitives. Set your threshold before you write production code. Measure on every PR. Iterate on scores, not gut feel."
"EDD needs a tool. Meet EvalHub. The key feature for this demo: Bring Your Own Provider. You write the scorer. EvalHub wraps it with CI gates, MLflow tracking, and Kubernetes scale. We'll see that live."
"Three layers. Same code, three environments. Laptop: Python and uv, instant feedback. Local pod: EvalHub server plus your adapter sidecar, CI gate enforced. OpenShift: same pod.yaml, the adapter runs as a Kubernetes Job. The evalhub.yaml doesn't change between any of them."
"Three scorers — each catches a different failure mode, all scored 0 to 1. Consistency: same question, five runs, Jaccard similarity — threshold 0.50. Catches non-determinism — users getting different answers. Domain F1: token overlap with ground truth — threshold 0.80. Catches wrong dates, wrong limits, missing facts. Confabulation: fraction of answers with NO hallucinated facts — threshold 0.90. The strictest threshold, because invented facts cause direct harm. Act 2 runs these locally. Act 3 registers them as a named EvalHub collection."
Show QR. Pause.
"If you want to follow along: scan the QR, clone the repo. Two commands:
uv syncandpodman play kube pod.yaml. No Kubernetes cluster. No cloud account. If you're watching — you'll see every output."
Switch to terminal. Keep slides hidden.
uv run python act1_green_ci.pyNarrate while it runs:
"We have an HR policy RAG pipeline. We have tests. Let's run them."
When tests go green:
"All three pass. CI would mark this build green and deploy."
When answer table appears:
"Same question, three calls. Look at the answers — some have 'October 31', some have 'November 30'. The policy says November 1–15. Not one test asked if the date was right."
Pause. Let it land.
uv run python act2_reveal.pyNarrate:
"Same pipeline. Three local scorers. No EvalHub server — just Python. This runs on any laptop in under five seconds."
When table appears — point to red rows:
"Red rows: hallucinated dates. The confabulation scorer caught them. Domain F1 around 0.42 average — answers are missing or wrong facts. Consistency Jaccard around 0.31 — same question, wildly different answers each run."
When FAILING panel appears:
"Three scorers. All failing. Same codebase that passed CI five seconds ago."
Interactive moment:
"If you ran it on your machine — what scores are you seeing? [pause] The numbers will vary — that's the point. Non-determinism is real."
"Act 2 is your fast local feedback loop. No infra. Write the scorer, run it, know immediately. Now watch what happens when we add EvalHub."
Show the running pod briefly:
podman ps --filter name=evalhub"That pod has two containers: EvalHub server on 8080, and our adapter sidecar watching a shared directory for job specs. Both share localhost — that's the pod's network namespace. One command started both:
podman play kube pod.yaml."
uv run python act3_edd_collection.pyNarrate Step 1 (provider registration):
"Step 1: register our adapter as a Bring Your Own Provider. providers/hr_rag_adapter.py — the same scorer code from Act 2, now registered as a first-class EvalHub provider."
Narrate Step 2 (collection creation):
"Step 2: create collection hr-rag-eval-v1. Three benchmarks defined once — Consistency at weight 0.5, Domain F1 and Confabulation at full weight. Weight 0.5 on consistency says: some phrasing variation is acceptable. Full weight on the other two says: wrong facts and hallucinations are not. Every future CI job references this collection by ID — no benchmark listing, no threshold duplication."
When YAML displays:
"This is evalhub.yaml. Three sections: provider — your adapter. collection — your benchmarks with weights and thresholds, defined once. job — just a model URL and a collection ID. That's the whole thing."
When job accepted:
"Job ID issued and logged. The adapter sidecar will pick up the spec, run the same scorers you just saw, and report results back via localhost. Representative scores are shown — the gate decision is deterministic: below threshold means BLOCKED."
When BLOCKED panel appears:
"All three below threshold. BLOCKED. Not by a human reviewer — by the CI gate reading EvalHub's verdict. The collection defines the bar. Your pipeline enforces it. Every run."
uv run python act4_production.pyNarrate:
"Act 4. Same evalhub.yaml. No changes."
Point to the K8s Job YAML:
"On OpenShift, EvalHub dispatches the adapter as a Kubernetes Job. The image is pre-built for linux/amd64 and linux/arm64: quay.io/wcabanba/hr-rag-adapter:latest — anyone can pull it."
Point to the pod ps output:
"That's our local pod. On OpenShift it's a Deployment. Same image. Same collection. Same thresholds."
When 1,000-sample results appear:
"1,000 samples instead of 5. All three still failing. Consistency 0.32 — non-determinism at scale. Domain F1 0.62 — 38% of answers have wrong facts. Confabulation 0.71 — 29% invent dates not in the document."
When MLflow panel appears:
"Every run logged. Run ID, model version, scores, hardware context. Debug any result from any commit. Audit trail for compliance. This is what 'reproducible' means in EDD."
Closing line:
"Three commands.
uv sync.podman play kube pod.yaml.uv run python act3_edd_collection.py. Laptop. Local pod. OpenShift. Same code. Same gate. Different scale. That is EDD."
"Three things: One — traditional QA is necessary but not sufficient. The gap is structural. Two — EDD is the methodology: Define, Measure, Iterate on scores. Three — EvalHub makes it operational: Bring Your Own Provider, named collections, same code from laptop to OpenShift. Scan the QR."
"EvalHub. Because 'looks good to me' is not a benchmark. Questions?"
| What broke | Fix |
|---|---|
| Pod won't start | ./scripts/reset-demo.sh |
| EvalHub not healthy | Wait 10 sec, retry curl localhost:8080/api/v1/health |
| Act 1 shows no wrong dates | Re-run — mock answers are random; Act 2 always surfaces confabulation |
| Act 3 collection creation fails | Check pod is running: podman ps --filter name=evalhub |
evalhub providers list returns nothing |
Run uv run evalhub config set tenant "" then retry |
| Demo computer crashes | Flip to hidden backup slides 9–12 — they contain all expected output |
| No network | Acts 1–2 run fully offline. Acts 3–4 need the local pod (pre-pulled images work offline) |
| Moment | What to say |
|---|---|
| Act 1, after green CI | "Who would have shipped this?" [pause] |
| Act 2, table appears | "What scores are you seeing on your laptop?" |
| Act 2, FAILING panel | "Before EvalHub existed, what would your team have done here?" |
| Act 3, collection created | "Every future CI run is now one line: --collection hr-rag-eval-v1" |
| Act 3, weights displayed | "Half-weight on consistency — some phrasing variation is OK. Full weight on facts and hallucinations — those are never OK." |
| Act 4, MLflow panel | "Who here has to do compliance audits? This is your paper trail." |