Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Prompts currently live in PromptLayer with versioning. PromptLayer is being full

### Evaluation

Online evaluation exists but is built in-house and poorly implemented. There is no reliable offline evaluation pipeline. Evaluation datasets are generated dynamically from production traces using heuristics, which results in variable quality. There is no golden dataset. CI-gating evaluation does not exist in a proper form — only some GitHub Actions making ad hoc calls.
Online evaluation exists but is built in-house and the implementation was rushed with a clear specific goal in mind making it not flexible enough to match any of our persona needs (engineers, PMs, MLAs). There is no reliable offline evaluation pipeline. Evaluation datasets are generated dynamically from production traces using heuristics, which results in variable quality. There is no golden dataset. CI-gating evaluation does not exist in a proper form — only some GitHub Actions making ad hoc calls.

### Tracing

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Crosswalk: LLM observability and evaluation platform requirements to prior inventory
Comment thread
r0ots marked this conversation as resolved.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see many criteria in this file that are coming from the LLM gateway part, not the obs/eval


This document maps the criteria in [llm-observability-and-evaluation-platform-requirements.md](llm-observability-and-evaluation-platform-requirements.md) to the prior inventory.

Rows are grouped by relationship: equivalent mappings first, then the remaining relationships, then newly added criteria.

| ID | New criterion | Legacy item(s) | Relationship | Change note |
|---|---|---|---|---|
| R-502 | Conditional scorer application | Conditional judge selection | Equivalent | Keeps conditional activation as a direct requirement while broadening it beyond judges. |
| R-603 | Parameterized sampling and slicing | Parameterized sampling | Equivalent | Preserves controlled sampling as a direct requirement. |
| R-604 | Ground-truth annotation workflows | Ground-truth annotations | Equivalent | Keeps human annotation and review as a direct requirement. |
| R-701 | Evaluation run browser | Run browser | Equivalent | Preserves the need for run-level browsing and inspection. |
| R-702 | Dataset browser and item inspection | Dataset browser | Equivalent | Preserves dataset browsing while making item-level inspection explicit. |
| R-901 | Role-based access control | RBAC | Equivalent | Preserves granular role control as a direct requirement. |
| R-902 | Enterprise identity integration | SSO | Equivalent | Preserves enterprise identity integration as a direct requirement. |
| R-802 | Shopper simulator compatibility | Conversation harness, Simulated shopper | Equivalent | Requires building an equivalent of today's shopper simulator and evaluating the resulting conversations without the vendor owning the simulation harness. |
| R-101 | Full-fidelity LLM interaction capture | Per-call request logging (PromptLayer) | Broadens | Reframed away from one logging product and expanded to cover the full execution record (including prompt version and label) needed for debugging and evaluation. |
| R-102 | Searchable call inspection | Per-call request logging (PromptLayer), Request log storage + search UI | Broadens | Makes inspection and searchability explicit instead of implying them through logging alone; also covers the legacy request-log search UI. |
| R-303 | Prompt-linked execution configuration | Fetch prompts parameters | Broadens | Reframes a list of current parameters as the broader need to store the executable prompt contract alongside the prompt. |
| R-401 | Prompt authoring and review UI | Web UI | Broadens | Reframes a generic web UI requirement as a concrete prompt-authoring and review surface. |
| R-501 | Custom evaluation logic | LLM judges, Reports / scoring | Broadens | Reframes current judges and native reports/scoring as the broader requirement for expressive custom scoring and evaluation execution, including LLM-based and deterministic evaluators. |
| R-601 | Trace- and log-derived dataset creation | Scenario heuristics, Datasets | Broadens | Reframes dbt-based scenario extraction and dataset creation from request logs as the broader need to build datasets from historical records and traces. |
| R-605 | Dataset versioning and reproducibility | Dataset versioning | Broadens | Reframes UUIDs and GCS config storage as the broader need for reproducible versioned datasets and run inputs. |
| R-703 | Record-level evaluation inspection | Per-call inspector | Broadens | Extends per-call inspection to the broader need for record-level evaluation debugging. |
| R-104 | Trace and application-context correlation | Datadog traces | Generalizes | Reframed from one tracing product to the broader need to correlate LLM calls with application context. |
| R-105 | Operational metrics surface | Datadog metrics, Analytics dashboards | Generalizes | Reframed from one metrics sink and dashboarding into the broader need for operational metrics visibility, aggregation, and trends. |
| R-202 | Exportable metrics and aggregates | Datadog metrics | Generalizes | Reframed from a Datadog integration requirement to a broader interoperability requirement. |
| R-304 | Prompt version targeting and release channels | Per-request prompt override, Release labels | Generalizes | Merges controlled runtime override with movable release labels without assuming override must use HTTP headers. |
| R-305 | Expressive templating | Jinja2 templating | Generalizes | Preserves the need for dynamic templating while avoiding lock-in to one templating engine. |
| R-306 | Structured-output contract association | Structured-output schema validation | Generalizes | Focuses on managing and retrieving the output contract with the prompt, regardless of where validation executes. |
| R-307 | Workspace and folder management | Per domain workspace, Multi-workspace | Generalizes | Reframes prompt isolation as operational management of large prompt sets across workspaces and folders. |
Comment thread
diephil marked this conversation as resolved.
| R-403 | Prompt organization and discoverability | Hierarchical organization | Generalizes | Preserves organizational structure without prescribing one hierarchy model. |
| R-404 | Prompt experimentation controls | A/B testing | Generalizes | Keeps the need for controlled prompt experimentation without prescribing one traffic-routing design. |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that A/B testing is a more recognisable term.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to take into account

| R-602 | Filterable and query-driven dataset selection | Scenario heuristics | Generalizes | Keeps the need for systematic scenario selection while removing dependence on dbt TVFs or SQL implementation details. |
| R-704 | Cross-entity lineage navigation | Prompt-registry deep-links | Generalizes | Reframes a vendor-specific deep link into the broader need to navigate across related evaluation and prompt entities. |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand the rewriting on this one.

The goal is to ensure that we can provide a deeplink to a specific object in the tool, as this is something we commonly do and would struggle without.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I saw your comments above about this. Sibling topics.
Your point: We need to be able to navigate within the tool easily to any components without doing a million step.
My point: We need to be able to share URL to direct objects as this is a very common thing to say "hey look at this specific execution."

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a "point of attention" section that's separate from the requirement list? Could be a separate QA doc ? @r0ots

| R-903 | Workspace-scoped credentials and access boundaries | API keys per workspace | Generalizes | Preserves isolation of programmatic credentials while allowing vendors to implement it with different auth models. |
| R-301 | Versioned prompt registry | Hosted prompt storage, Immutable versioning, Fetch prompts | Merges and broadens | Reframes server-side prompt storage, immutable publish semantics, and retrieval as one core versioned-registry requirement. |
| R-302 | Runtime prompt resolution | Fetch prompts, Per-request prompt override | Merges and generalizes | Combines baseline retrieval and runtime targeting into a broader runtime-resolution requirement. |
| R-402 | Reusable prompt components | Snippet versionning & labels, Snippet expansion (server-side), Snippet sharing across prompts, Snippet auto-propagation | Merges and generalizes | Reframes vendor-specific snippet behavior as the broader need for centrally managed reusable prompt components. |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I will cite my bad experience with basalt on snippets, leading to explicit this requirement: Their snippet management was bad.

  • They had a kind of organisation called "features" (kinda like a folder, but everything is flat in it), and snippets were only shareable within the same feature. So during the pilot, I was forced to have all prompts in the same feature to be able to share snippets with all prompts.
  • Snippets had versions, but it was hidden. You could only edit the "current" version of the snippet. but prompts would not be updated automatically. Each prompt would keep the last version of the snippet they were updated to, and you would have to go 1 by 1 to each prompt and manually update the snippet.
    Unusable...
    So it's not only about reusable components, but having the feature working well too 😓

| R-405 | Programmatic prompt-management interfaces | REST API, MCP, Python SDK | Merges and generalizes | Reframes several specific access surfaces as the broader need for programmable interfaces across application and agent contexts. |
| R-503 | Domain-metric expressiveness | Outcome metrics, Retrieval precision / recall, Multilabel macro precision / recall | Merges and generalizes | Reframes current business and task metrics as examples of broader custom metric expressiveness. |
| R-801 | Programmatic and CI evaluation execution | GitHub Actions workflows, Local runner, Persistence | Merges and generalizes | Merges external orchestration with programmatic run and result access: run evaluations in CI, from the CLI, or programmatically, and retrieve runs and outputs. |
| R-201 | Exportable raw observability data | Per-call request logging (BigQuery) | Replaces | Reframed from a specific warehouse sink to the broader requirement for raw-data export. |
| None | No direct requirement carried forward | Sentry error capture | Dropped | Treated as application-level error tracking rather than a platform selection criterion for LLM observability. |
| None | No direct requirement carried forward | OpenTelemetry tracing, Webhooks | Dropped | Treated as non-core integration surfaces for this evaluation unless a later procurement stage elevates them. |
| R-103 | Conversation and workflow grouping | None explicit | New | Added because isolated call logs are not enough for agent and support workflows. |
| R-106 | Retention and queryability | None explicit | New | Added because data usefulness depends on retention window and practical query access. |
| R-107 | Access control and sensitive-data handling | None explicit | New | Added because observability data can contain sensitive content and must be governed safely. |
| R-203 | ADK and Temporal tracing compatibility | None explicit | New | Added to require compatibility with Gorgias's custom ADK on Temporal, with instrumentation driven primarily from TypeScript rather than assuming a Python-first or vendor-owned runtime. |
| R-705 | UI-configurable experiment parameters | None explicit | New | Added so engineers can tune model parameters and hyperparameters for an offline evaluation from the UI without committing code. |
| R-706 | Domain-expert review usability | None explicit | New | Added to require that non-technical domain experts (mostly product managers) can directly participate in review, annotation, and evaluation workflows without relying on code-centric tooling. |
Loading