-
Notifications
You must be signed in to change notification settings - Fork 0
docs: define LLM observability eval requirements #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| # Crosswalk: LLM observability and evaluation platform requirements to prior inventory | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see many criteria in this file that are coming from the LLM gateway part, not the obs/eval |
||
|
|
||
| This document maps the criteria in [llm-observability-and-evaluation-platform-requirements.md](llm-observability-and-evaluation-platform-requirements.md) to the prior inventory. | ||
|
|
||
| Rows are grouped by relationship: equivalent mappings first, then the remaining relationships, then newly added criteria. | ||
|
|
||
| | ID | New criterion | Legacy item(s) | Relationship | Change note | | ||
| |---|---|---|---|---| | ||
| | R-502 | Conditional scorer application | Conditional judge selection | Equivalent | Keeps conditional activation as a direct requirement while broadening it beyond judges. | | ||
| | R-603 | Parameterized sampling and slicing | Parameterized sampling | Equivalent | Preserves controlled sampling as a direct requirement. | | ||
| | R-604 | Ground-truth annotation workflows | Ground-truth annotations | Equivalent | Keeps human annotation and review as a direct requirement. | | ||
| | R-701 | Evaluation run browser | Run browser | Equivalent | Preserves the need for run-level browsing and inspection. | | ||
| | R-702 | Dataset browser and item inspection | Dataset browser | Equivalent | Preserves dataset browsing while making item-level inspection explicit. | | ||
| | R-901 | Role-based access control | RBAC | Equivalent | Preserves granular role control as a direct requirement. | | ||
| | R-902 | Enterprise identity integration | SSO | Equivalent | Preserves enterprise identity integration as a direct requirement. | | ||
| | R-802 | Shopper simulator compatibility | Conversation harness, Simulated shopper | Equivalent | Requires building an equivalent of today's shopper simulator and evaluating the resulting conversations without the vendor owning the simulation harness. | | ||
| | R-101 | Full-fidelity LLM interaction capture | Per-call request logging (PromptLayer) | Broadens | Reframed away from one logging product and expanded to cover the full execution record (including prompt version and label) needed for debugging and evaluation. | | ||
| | R-102 | Searchable call inspection | Per-call request logging (PromptLayer), Request log storage + search UI | Broadens | Makes inspection and searchability explicit instead of implying them through logging alone; also covers the legacy request-log search UI. | | ||
| | R-303 | Prompt-linked execution configuration | Fetch prompts parameters | Broadens | Reframes a list of current parameters as the broader need to store the executable prompt contract alongside the prompt. | | ||
| | R-401 | Prompt authoring and review UI | Web UI | Broadens | Reframes a generic web UI requirement as a concrete prompt-authoring and review surface. | | ||
| | R-501 | Custom evaluation logic | LLM judges, Reports / scoring | Broadens | Reframes current judges and native reports/scoring as the broader requirement for expressive custom scoring and evaluation execution, including LLM-based and deterministic evaluators. | | ||
| | R-601 | Trace- and log-derived dataset creation | Scenario heuristics, Datasets | Broadens | Reframes dbt-based scenario extraction and dataset creation from request logs as the broader need to build datasets from historical records and traces. | | ||
| | R-605 | Dataset versioning and reproducibility | Dataset versioning | Broadens | Reframes UUIDs and GCS config storage as the broader need for reproducible versioned datasets and run inputs. | | ||
| | R-703 | Record-level evaluation inspection | Per-call inspector | Broadens | Extends per-call inspection to the broader need for record-level evaluation debugging. | | ||
| | R-104 | Trace and application-context correlation | Datadog traces | Generalizes | Reframed from one tracing product to the broader need to correlate LLM calls with application context. | | ||
| | R-105 | Operational metrics surface | Datadog metrics, Analytics dashboards | Generalizes | Reframed from one metrics sink and dashboarding into the broader need for operational metrics visibility, aggregation, and trends. | | ||
| | R-202 | Exportable metrics and aggregates | Datadog metrics | Generalizes | Reframed from a Datadog integration requirement to a broader interoperability requirement. | | ||
| | R-304 | Prompt version targeting and release channels | Per-request prompt override, Release labels | Generalizes | Merges controlled runtime override with movable release labels without assuming override must use HTTP headers. | | ||
| | R-305 | Expressive templating | Jinja2 templating | Generalizes | Preserves the need for dynamic templating while avoiding lock-in to one templating engine. | | ||
| | R-306 | Structured-output contract association | Structured-output schema validation | Generalizes | Focuses on managing and retrieving the output contract with the prompt, regardless of where validation executes. | | ||
| | R-307 | Workspace and folder management | Per domain workspace, Multi-workspace | Generalizes | Reframes prompt isolation as operational management of large prompt sets across workspaces and folders. | | ||
|
diephil marked this conversation as resolved.
|
||
| | R-403 | Prompt organization and discoverability | Hierarchical organization | Generalizes | Preserves organizational structure without prescribing one hierarchy model. | | ||
| | R-404 | Prompt experimentation controls | A/B testing | Generalizes | Keeps the need for controlled prompt experimentation without prescribing one traffic-routing design. | | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would argue that A/B testing is a more recognisable term.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. to take into account |
||
| | R-602 | Filterable and query-driven dataset selection | Scenario heuristics | Generalizes | Keeps the need for systematic scenario selection while removing dependence on dbt TVFs or SQL implementation details. | | ||
| | R-704 | Cross-entity lineage navigation | Prompt-registry deep-links | Generalizes | Reframes a vendor-specific deep link into the broader need to navigate across related evaluation and prompt entities. | | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not sure I understand the rewriting on this one. The goal is to ensure that we can provide a deeplink to a specific object in the tool, as this is something we commonly do and would struggle without.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok I saw your comments above about this. Sibling topics.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe add a "point of attention" section that's separate from the requirement list? Could be a separate QA doc ? @r0ots |
||
| | R-903 | Workspace-scoped credentials and access boundaries | API keys per workspace | Generalizes | Preserves isolation of programmatic credentials while allowing vendors to implement it with different auth models. | | ||
| | R-301 | Versioned prompt registry | Hosted prompt storage, Immutable versioning, Fetch prompts | Merges and broadens | Reframes server-side prompt storage, immutable publish semantics, and retrieval as one core versioned-registry requirement. | | ||
| | R-302 | Runtime prompt resolution | Fetch prompts, Per-request prompt override | Merges and generalizes | Combines baseline retrieval and runtime targeting into a broader runtime-resolution requirement. | | ||
| | R-402 | Reusable prompt components | Snippet versionning & labels, Snippet expansion (server-side), Snippet sharing across prompts, Snippet auto-propagation | Merges and generalizes | Reframes vendor-specific snippet behavior as the broader need for centrally managed reusable prompt components. | | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again I will cite my bad experience with basalt on snippets, leading to explicit this requirement: Their snippet management was bad.
|
||
| | R-405 | Programmatic prompt-management interfaces | REST API, MCP, Python SDK | Merges and generalizes | Reframes several specific access surfaces as the broader need for programmable interfaces across application and agent contexts. | | ||
| | R-503 | Domain-metric expressiveness | Outcome metrics, Retrieval precision / recall, Multilabel macro precision / recall | Merges and generalizes | Reframes current business and task metrics as examples of broader custom metric expressiveness. | | ||
| | R-801 | Programmatic and CI evaluation execution | GitHub Actions workflows, Local runner, Persistence | Merges and generalizes | Merges external orchestration with programmatic run and result access: run evaluations in CI, from the CLI, or programmatically, and retrieve runs and outputs. | | ||
| | R-201 | Exportable raw observability data | Per-call request logging (BigQuery) | Replaces | Reframed from a specific warehouse sink to the broader requirement for raw-data export. | | ||
| | None | No direct requirement carried forward | Sentry error capture | Dropped | Treated as application-level error tracking rather than a platform selection criterion for LLM observability. | | ||
| | None | No direct requirement carried forward | OpenTelemetry tracing, Webhooks | Dropped | Treated as non-core integration surfaces for this evaluation unless a later procurement stage elevates them. | | ||
| | R-103 | Conversation and workflow grouping | None explicit | New | Added because isolated call logs are not enough for agent and support workflows. | | ||
| | R-106 | Retention and queryability | None explicit | New | Added because data usefulness depends on retention window and practical query access. | | ||
| | R-107 | Access control and sensitive-data handling | None explicit | New | Added because observability data can contain sensitive content and must be governed safely. | | ||
| | R-203 | ADK and Temporal tracing compatibility | None explicit | New | Added to require compatibility with Gorgias's custom ADK on Temporal, with instrumentation driven primarily from TypeScript rather than assuming a Python-first or vendor-owned runtime. | | ||
| | R-705 | UI-configurable experiment parameters | None explicit | New | Added so engineers can tune model parameters and hyperparameters for an offline evaluation from the UI without committing code. | | ||
| | R-706 | Domain-expert review usability | None explicit | New | Added to require that non-technical domain experts (mostly product managers) can directly participate in review, annotation, and evaluation workflows without relying on code-centric tooling. | | ||
Uh oh!
There was an error while loading. Please reload this page.