gorgias · diephil · Jun 2, 2026 · Jun 2, 2026 · Jun 4, 2026 · Jun 4, 2026
diff --git a/README.md b/README.md
@@ -32,7 +32,7 @@ Prompts currently live in PromptLayer with versioning. PromptLayer is being full
 
 ### Evaluation
 
-Online evaluation exists but is built in-house and poorly implemented. There is no reliable offline evaluation pipeline. Evaluation datasets are generated dynamically from production traces using heuristics, which results in variable quality. There is no golden dataset. CI-gating evaluation does not exist in a proper form — only some GitHub Actions making ad hoc calls.
+Online evaluation exists but is built in-house and the implementation was rushed with a clear specific goal in mind making it not flexible enough to match any of our persona needs (engineers, PMs, MLAs). There is no reliable offline evaluation pipeline. Evaluation datasets are generated dynamically from production traces using heuristics, which results in variable quality. There is no golden dataset. CI-gating evaluation does not exist in a proper form — only some GitHub Actions making ad hoc calls.
 
 ### Tracing
 

diff --git a/docs/llm-observability-and-evaluation-platform-requirements-crosswalk.md b/docs/llm-observability-and-evaluation-platform-requirements-crosswalk.md
@@ -0,0 +1,51 @@
+# Crosswalk: LLM observability and evaluation platform requirements to prior inventory
+
+This document maps the criteria in [llm-observability-and-evaluation-platform-requirements.md](llm-observability-and-evaluation-platform-requirements.md) to the prior inventory.
+
+Rows are grouped by relationship: equivalent mappings first, then the remaining relationships, then newly added criteria.
+
+| ID | New criterion | Legacy item(s) | Relationship | Change note |
+|---|---|---|---|---|
+| R-502 | Conditional scorer application | Conditional judge selection | Equivalent | Keeps conditional activation as a direct requirement while broadening it beyond judges. |
+| R-603 | Parameterized sampling and slicing | Parameterized sampling | Equivalent | Preserves controlled sampling as a direct requirement. |
+| R-604 | Ground-truth annotation workflows | Ground-truth annotations | Equivalent | Keeps human annotation and review as a direct requirement. |
+| R-701 | Evaluation run browser | Run browser | Equivalent | Preserves the need for run-level browsing and inspection. |
+| R-702 | Dataset browser and item inspection | Dataset browser | Equivalent | Preserves dataset browsing while making item-level inspection explicit. |
+| R-901 | Role-based access control | RBAC | Equivalent | Preserves granular role control as a direct requirement. |
+| R-902 | Enterprise identity integration | SSO | Equivalent | Preserves enterprise identity integration as a direct requirement. |
+| R-802 | Shopper simulator compatibility | Conversation harness, Simulated shopper | Equivalent | Requires building an equivalent of today's shopper simulator and evaluating the resulting conversations without the vendor owning the simulation harness. |
+| R-101 | Full-fidelity LLM interaction capture | Per-call request logging (PromptLayer) | Broadens | Reframed away from one logging product and expanded to cover the full execution record (including prompt version and label) needed for debugging and evaluation. |
+| R-102 | Searchable call inspection | Per-call request logging (PromptLayer), Request log storage + search UI | Broadens | Makes inspection and searchability explicit instead of implying them through logging alone; also covers the legacy request-log search UI. |
+| R-303 | Prompt-linked execution configuration | Fetch prompts parameters | Broadens | Reframes a list of current parameters as the broader need to store the executable prompt contract alongside the prompt. |
+| R-401 | Prompt authoring and review UI | Web UI | Broadens | Reframes a generic web UI requirement as a concrete prompt-authoring and review surface. |
+| R-501 | Custom evaluation logic | LLM judges, Reports / scoring | Broadens | Reframes current judges and native reports/scoring as the broader requirement for expressive custom scoring and evaluation execution, including LLM-based and deterministic evaluators. |
+| R-601 | Trace- and log-derived dataset creation | Scenario heuristics, Datasets | Broadens | Reframes dbt-based scenario extraction and dataset creation from request logs as the broader need to build datasets from historical records and traces. |
+| R-605 | Dataset versioning and reproducibility | Dataset versioning | Broadens | Reframes UUIDs and GCS config storage as the broader need for reproducible versioned datasets and run inputs. |
+| R-703 | Record-level evaluation inspection | Per-call inspector | Broadens | Extends per-call inspection to the broader need for record-level evaluation debugging. |
+| R-104 | Trace and application-context correlation | Datadog traces | Generalizes | Reframed from one tracing product to the broader need to correlate LLM calls with application context. |
+| R-105 | Operational metrics surface | Datadog metrics, Analytics dashboards | Generalizes | Reframed from one metrics sink and dashboarding into the broader need for operational metrics visibility, aggregation, and trends. |
+| R-202 | Exportable metrics and aggregates | Datadog metrics | Generalizes | Reframed from a Datadog integration requirement to a broader interoperability requirement. |
+| R-304 | Prompt version targeting and release channels | Per-request prompt override, Release labels | Generalizes | Merges controlled runtime override with movable release labels without assuming override must use HTTP headers. |
+| R-305 | Expressive templating | Jinja2 templating | Generalizes | Preserves the need for dynamic templating while avoiding lock-in to one templating engine. |
+| R-306 | Structured-output contract association | Structured-output schema validation | Generalizes | Focuses on managing and retrieving the output contract with the prompt, regardless of where validation executes. |
+| R-307 | Workspace and folder management | Per domain workspace, Multi-workspace | Generalizes | Reframes prompt isolation as operational management of large prompt sets across workspaces and folders. |
+| R-403 | Prompt organization and discoverability | Hierarchical organization | Generalizes | Preserves organizational structure without prescribing one hierarchy model. |
+| R-404 | Prompt experimentation controls | A/B testing | Generalizes | Keeps the need for controlled prompt experimentation without prescribing one traffic-routing design. |
+| R-602 | Filterable and query-driven dataset selection | Scenario heuristics | Generalizes | Keeps the need for systematic scenario selection while removing dependence on dbt TVFs or SQL implementation details. |
+| R-704 | Cross-entity lineage navigation | Prompt-registry deep-links | Generalizes | Reframes a vendor-specific deep link into the broader need to navigate across related evaluation and prompt entities. |
+| R-903 | Workspace-scoped credentials and access boundaries | API keys per workspace | Generalizes | Preserves isolation of programmatic credentials while allowing vendors to implement it with different auth models. |
+| R-301 | Versioned prompt registry | Hosted prompt storage, Immutable versioning, Fetch prompts | Merges and broadens | Reframes server-side prompt storage, immutable publish semantics, and retrieval as one core versioned-registry requirement. |
+| R-302 | Runtime prompt resolution | Fetch prompts, Per-request prompt override | Merges and generalizes | Combines baseline retrieval and runtime targeting into a broader runtime-resolution requirement. |
+| R-402 | Reusable prompt components | Snippet versionning & labels, Snippet expansion (server-side), Snippet sharing across prompts, Snippet auto-propagation | Merges and generalizes | Reframes vendor-specific snippet behavior as the broader need for centrally managed reusable prompt components. |
+| R-405 | Programmatic prompt-management interfaces | REST API, MCP, Python SDK | Merges and generalizes | Reframes several specific access surfaces as the broader need for programmable interfaces across application and agent contexts. |
+| R-503 | Domain-metric expressiveness | Outcome metrics, Retrieval precision / recall, Multilabel macro precision / recall | Merges and generalizes | Reframes current business and task metrics as examples of broader custom metric expressiveness. |
+| R-801 | Programmatic and CI evaluation execution | GitHub Actions workflows, Local runner, Persistence | Merges and generalizes | Merges external orchestration with programmatic run and result access: run evaluations in CI, from the CLI, or programmatically, and retrieve runs and outputs. |
+| R-201 | Exportable raw observability data | Per-call request logging (BigQuery) | Replaces | Reframed from a specific warehouse sink to the broader requirement for raw-data export. |
+| None | No direct requirement carried forward | Sentry error capture | Dropped | Treated as application-level error tracking rather than a platform selection criterion for LLM observability. |
+| None | No direct requirement carried forward | OpenTelemetry tracing, Webhooks | Dropped | Treated as non-core integration surfaces for this evaluation unless a later procurement stage elevates them. |
+| R-103 | Conversation and workflow grouping | None explicit | New | Added because isolated call logs are not enough for agent and support workflows. |
+| R-106 | Retention and queryability | None explicit | New | Added because data usefulness depends on retention window and practical query access. |
+| R-107 | Access control and sensitive-data handling | None explicit | New | Added because observability data can contain sensitive content and must be governed safely. |
+| R-203 | ADK and Temporal tracing compatibility | None explicit | New | Added to require compatibility with Gorgias's custom ADK on Temporal, with instrumentation driven primarily from TypeScript rather than assuming a Python-first or vendor-owned runtime. |
+| R-705 | UI-configurable experiment parameters | None explicit | New | Added so engineers can tune model parameters and hyperparameters for an offline evaluation from the UI without committing code. |
+| R-706 | Domain-expert review usability | None explicit | New | Added to require that non-technical domain experts (mostly product managers) can directly participate in review, annotation, and evaluation workflows without relying on code-centric tooling. |