Skip to content

Latest commit

 

History

History
322 lines (230 loc) · 10.9 KB

File metadata and controls

322 lines (230 loc) · 10.9 KB

Live Demo Presenter Guide

EDD: From Your Laptop to Production

The one-line story: Three commands take you from a broken RAG pipeline on your laptop to the same evaluation running as a hard CI gate on OpenShift.


The Arc

LAPTOP                    LOCAL POD                 OPENSHIFT
──────────────────────    ──────────────────────    ──────────────────────
uv run act1_green_ci.py   podman play kube          oc apply -f pod.yaml
uv run act2_reveal.py     pod.yaml
                          uv run act3_edd_...        (same evalhub.yaml)
↓                         ↓                         ↓
No infra                  EvalHub + adapter          EvalHub Deployment
Python + uv only          CI gate enforced           Kubernetes Jobs
Local scorers             Collection hr-rag-eval-v1  1,000-sample scale
(immediate feedback)      (reproducible + tracked)   (MLflow audit trail)

The audience insight: the evalhub.yaml file doesn't change between environments. The BYOP adapter image runs as a sidecar locally and as a K8s Job on OpenShift. Same code. Same thresholds. Same gate. Different scale.


Before the Demo (Presenter Checklist)

Run this 15 minutes before going on stage:

cd edd-demo/

# 1. Verify Python deps
uv run python -c "from rag_app.pipeline import RAGPipeline; print('OK')"

# 2. Start a clean pod (pulls images if not cached, starts EvalHub + adapter)
./scripts/reset-demo.sh

# 3. Configure the evalhub CLI — ONE-TIME ONLY, survives reboots
#    Skip if already configured. tenant must be empty or CLI returns "(no data)".
uv run evalhub config set base_url http://localhost:8080
uv run evalhub config set tenant ""
uv run evalhub health   # should show "healthy"

# 4. Warm up mock randomness (confirms pipeline + scorers run cleanly)
uv run python act1_green_ci.py
uv run python act2_reveal.py

# 5. Keep terminal font at 18pt+, white-on-dark theme

Have open in separate tabs/panes:

  • Tab 1: terminal in edd-demo/ (demo runs here)
  • Tab 2: cat evalhub.yaml ready to show
  • Tab 3: podman ps --filter name=evalhub to show the running pod

Slide-by-Slide + Demo Script

SLIDE 1 — Title (30 sec)

Show the subtitle: "CI is green. Your AI is wrong. EvalHub catches both — live."

"Quick show of hands — who's had CI go green while their AI shipped the wrong answer? Yeah. That's what we fix today. Laptops out — you can follow along."


SLIDE 2 — A Familiar Scenario (1 min)

Walk the checklist fast. Land on:

"Three weeks in. Wrong eligibility date. No error. No alert. CI: still green. Nobody tested whether the answer was right."


SLIDE 3 — Three Broken Assumptions (1 min)

"Three assumptions you made when you wrote your first AI test — all wrong. Same input, different outputs. Pass/fail doesn't apply to a 0–1 score. Your fixtures don't cover what the model doesn't know. Your tests passed. Your AI was still wrong."


SLIDE 4 — EDD: Define → Measure → Iterate (1 min)

"The fix is a methodology shift. EDD — same discipline as TDD, different primitives. Set your threshold before you write production code. Measure on every PR. Iterate on scores, not gut feel."


SLIDE 5 — EvalHub Introduction (45 sec)

"EDD needs a tool. Meet EvalHub. The key feature for this demo: Bring Your Own Provider. You write the scorer. EvalHub wraps it with CI gates, MLflow tracking, and Kubernetes scale. We'll see that live."


SLIDE 6 — How EvalHub Fits Together (30 sec)

"Three layers. Same code, three environments. Laptop: Python and uv, instant feedback. Local pod: EvalHub server plus your adapter sidecar, CI gate enforced. OpenShift: same pod.yaml, the adapter runs as a Kubernetes Job. The evalhub.yaml doesn't change between any of them."


SLIDE 7 — Three Scorers (45 sec)

"Three scorers — each catches a different failure mode, all scored 0 to 1. Consistency: same question, five runs, Jaccard similarity — threshold 0.50. Catches non-determinism — users getting different answers. Domain F1: token overlap with ground truth — threshold 0.80. Catches wrong dates, wrong limits, missing facts. Confabulation: fraction of answers with NO hallucinated facts — threshold 0.90. The strictest threshold, because invented facts cause direct harm. Act 2 runs these locally. Act 3 registers them as a named EvalHub collection."


SLIDE 8 — Prerequisites (30 sec)

Show QR. Pause.

"If you want to follow along: scan the QR, clone the repo. Two commands: uv sync and podman play kube pod.yaml. No Kubernetes cluster. No cloud account. If you're watching — you'll see every output."


LIVE DEMO (~8 min, no slides)

Switch to terminal. Keep slides hidden.


ACT 1 — The TDD Illusion (1.5 min)

uv run python act1_green_ci.py

Narrate while it runs:

"We have an HR policy RAG pipeline. We have tests. Let's run them."

When tests go green:

"All three pass. CI would mark this build green and deploy."

When answer table appears:

"Same question, three calls. Look at the answers — some have 'October 31', some have 'November 30'. The policy says November 1–15. Not one test asked if the date was right."

Pause. Let it land.


ACT 2 — The Reveal (2 min)

uv run python act2_reveal.py

Narrate:

"Same pipeline. Three local scorers. No EvalHub server — just Python. This runs on any laptop in under five seconds."

When table appears — point to red rows:

"Red rows: hallucinated dates. The confabulation scorer caught them. Domain F1 around 0.42 average — answers are missing or wrong facts. Consistency Jaccard around 0.31 — same question, wildly different answers each run."

When FAILING panel appears:

"Three scorers. All failing. Same codebase that passed CI five seconds ago."

Interactive moment:

"If you ran it on your machine — what scores are you seeing? [pause] The numbers will vary — that's the point. Non-determinism is real."

"Act 2 is your fast local feedback loop. No infra. Write the scorer, run it, know immediately. Now watch what happens when we add EvalHub."


ACT 3 — EDD in Action (3 min)

Show the running pod briefly:

podman ps --filter name=evalhub

"That pod has two containers: EvalHub server on 8080, and our adapter sidecar watching a shared directory for job specs. Both share localhost — that's the pod's network namespace. One command started both: podman play kube pod.yaml."

uv run python act3_edd_collection.py

Narrate Step 1 (provider registration):

"Step 1: register our adapter as a Bring Your Own Provider. providers/hr_rag_adapter.py — the same scorer code from Act 2, now registered as a first-class EvalHub provider."

Narrate Step 2 (collection creation):

"Step 2: create collection hr-rag-eval-v1. Three benchmarks defined once — Consistency at weight 0.5, Domain F1 and Confabulation at full weight. Weight 0.5 on consistency says: some phrasing variation is acceptable. Full weight on the other two says: wrong facts and hallucinations are not. Every future CI job references this collection by ID — no benchmark listing, no threshold duplication."

When YAML displays:

"This is evalhub.yaml. Three sections: provider — your adapter. collection — your benchmarks with weights and thresholds, defined once. job — just a model URL and a collection ID. That's the whole thing."

When job accepted:

"Job ID issued and logged. The adapter sidecar will pick up the spec, run the same scorers you just saw, and report results back via localhost. Representative scores are shown — the gate decision is deterministic: below threshold means BLOCKED."

When BLOCKED panel appears:

"All three below threshold. BLOCKED. Not by a human reviewer — by the CI gate reading EvalHub's verdict. The collection defines the bar. Your pipeline enforces it. Every run."


ACT 4 — Production Scale (1.5 min)

uv run python act4_production.py

Narrate:

"Act 4. Same evalhub.yaml. No changes."

Point to the K8s Job YAML:

"On OpenShift, EvalHub dispatches the adapter as a Kubernetes Job. The image is pre-built for linux/amd64 and linux/arm64: quay.io/wcabanba/hr-rag-adapter:latest — anyone can pull it."

Point to the pod ps output:

"That's our local pod. On OpenShift it's a Deployment. Same image. Same collection. Same thresholds."

When 1,000-sample results appear:

"1,000 samples instead of 5. All three still failing. Consistency 0.32 — non-determinism at scale. Domain F1 0.62 — 38% of answers have wrong facts. Confabulation 0.71 — 29% invent dates not in the document."

When MLflow panel appears:

"Every run logged. Run ID, model version, scores, hardware context. Debug any result from any commit. Audit trail for compliance. This is what 'reproducible' means in EDD."

Closing line:

"Three commands. uv sync. podman play kube pod.yaml. uv run python act3_edd_collection.py. Laptop. Local pod. OpenShift. Same code. Same gate. Different scale. That is EDD."


Takeaways Slide (30 sec)

"Three things: One — traditional QA is necessary but not sufficient. The gap is structural. Two — EDD is the methodology: Define, Measure, Iterate on scores. Three — EvalHub makes it operational: Bring Your Own Provider, named collections, same code from laptop to OpenShift. Scan the QR."


Closing (15 sec)

"EvalHub. Because 'looks good to me' is not a benchmark. Questions?"


Recovery Playbook

What broke Fix
Pod won't start ./scripts/reset-demo.sh
EvalHub not healthy Wait 10 sec, retry curl localhost:8080/api/v1/health
Act 1 shows no wrong dates Re-run — mock answers are random; Act 2 always surfaces confabulation
Act 3 collection creation fails Check pod is running: podman ps --filter name=evalhub
evalhub providers list returns nothing Run uv run evalhub config set tenant "" then retry
Demo computer crashes Flip to hidden backup slides 9–12 — they contain all expected output
No network Acts 1–2 run fully offline. Acts 3–4 need the local pod (pre-pulled images work offline)

Audience Interaction Points

Moment What to say
Act 1, after green CI "Who would have shipped this?" [pause]
Act 2, table appears "What scores are you seeing on your laptop?"
Act 2, FAILING panel "Before EvalHub existed, what would your team have done here?"
Act 3, collection created "Every future CI run is now one line: --collection hr-rag-eval-v1"
Act 3, weights displayed "Half-weight on consistency — some phrasing variation is OK. Full weight on facts and hallucinations — those are never OK."
Act 4, MLflow panel "Who here has to do compliance audits? This is your paper trail."