diff --git a/CLAUDE.md b/CLAUDE.md index 262a949..78d83f4 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -36,6 +36,12 @@ Each `candidates//` directory is a self-contained backend: - Playbook phases are numbered 0–10 matching `docs/ai-agent-playbook/` - Do not add content to `scratchpads/` unless it is personal working notes +## Writing style + +- Write plainly. Prefer short, direct sentences. +- Prefer simpler, common words over technical-sounding or lyrical ones. +- If something can be said in fewer words, use fewer words. + --- ## What not to do diff --git a/README.md b/README.md index 19845ba..0a8dfaa 100644 --- a/README.md +++ b/README.md @@ -32,7 +32,7 @@ Prompts currently live in PromptLayer with versioning. PromptLayer is being full ### Evaluation -Online evaluation exists but is built in-house and poorly implemented. There is no reliable offline evaluation pipeline. Evaluation datasets are generated dynamically from production traces using heuristics, which results in variable quality. There is no golden dataset. CI-gating evaluation does not exist in a proper form — only some GitHub Actions making ad hoc calls. +Online evaluation exists but is built in-house and the implementation was rushed with a clear specific goal in mind making it not flexible enough to match any of our persona needs (engineers, PMs, MLAs). There is no reliable offline evaluation pipeline. Evaluation datasets are generated dynamically from production traces using heuristics, which results in variable quality. There is no golden dataset. CI-gating evaluation does not exist in a proper form — only some GitHub Actions making ad hoc calls. ### Tracing diff --git a/docs/llm-observability-and-evaluation-platform-requirements-crosswalk.md b/docs/llm-observability-and-evaluation-platform-requirements-crosswalk.md new file mode 100644 index 0000000..4e158d6 --- /dev/null +++ b/docs/llm-observability-and-evaluation-platform-requirements-crosswalk.md @@ -0,0 +1,51 @@ +# Crosswalk: LLM observability and evaluation platform requirements to prior inventory + +This document maps the criteria in [llm-observability-and-evaluation-platform-requirements.md](llm-observability-and-evaluation-platform-requirements.md) to the prior inventory. + +Rows are grouped by relationship: equivalent mappings first, then the remaining relationships, then newly added criteria. + +| ID | New criterion | Legacy item(s) | Relationship | Change note | +|---|---|---|---|---| +| R-502 | Conditional scorer application | Conditional judge selection | Equivalent | Keeps conditional activation as a direct requirement while broadening it beyond judges. | +| R-603 | Parameterized sampling and slicing | Parameterized sampling | Equivalent | Preserves controlled sampling as a direct requirement. | +| R-604 | Ground-truth annotation workflows | Ground-truth annotations | Equivalent | Keeps human annotation and review as a direct requirement. | +| R-701 | Evaluation run browser | Run browser | Equivalent | Preserves the need for run-level browsing and inspection. | +| R-702 | Dataset browser and item inspection | Dataset browser | Equivalent | Preserves dataset browsing while making item-level inspection explicit. | +| R-901 | Role-based access control | RBAC | Equivalent | Preserves granular role control as a direct requirement. | +| R-902 | Enterprise identity integration | SSO | Equivalent | Preserves enterprise identity integration as a direct requirement. | +| R-802 | Shopper simulator compatibility | Conversation harness, Simulated shopper | Equivalent | Requires support for shopper simulation work, including building an equivalent of today's shopper simulator and evaluating the resulting conversations without the vendor owning the simulation harness. | +| R-101 | Full-fidelity LLM interaction capture | Per-call request logging (PromptLayer) | Broadens | Reframed away from one logging product and expanded to cover the full execution record (including prompt version and label) needed for debugging and evaluation. | +| R-102 | Searchable call inspection | Per-call request logging (PromptLayer), Request log storage + search UI | Broadens | Makes inspection and searchability explicit instead of implying them through logging alone; also covers the legacy request-log search UI. | +| R-303 | Prompt-linked execution configuration | Fetch prompts parameters | Broadens | Reframes a list of current parameters as the broader need to store the executable prompt contract alongside the prompt. | +| R-401 | Prompt authoring and review UI | Web UI | Broadens | Reframes a generic web UI requirement as a concrete prompt-authoring and review surface. | +| R-501 | Custom evaluation logic | LLM judges, Reports / scoring | Broadens | Reframes current judges and native reports/scoring as the broader requirement for expressive custom scoring and evaluation execution, including LLM-based and deterministic evaluators. | +| R-601 | Trace- and log-derived dataset creation | Scenario heuristics, Datasets | Broadens | Reframes dbt-based scenario extraction and dataset creation from request logs as the broader need to build datasets from historical records and traces. | +| R-605 | Dataset versioning and reproducibility | Dataset versioning | Broadens | Reframes UUIDs and GCS config storage as the broader need for reproducible versioned datasets and run inputs. | +| R-703 | Record-level evaluation inspection | Per-call inspector | Broadens | Extends per-call inspection to the broader need for record-level evaluation debugging. | +| R-104 | Trace and application-context correlation | Datadog traces | Generalizes | Reframed from one tracing product to the broader need to correlate LLM calls with application context. | +| R-105 | Operational metrics surface | Datadog metrics, Analytics dashboards | Generalizes | Reframed from one metrics sink and dashboarding into the broader need for operational metrics visibility, aggregation, and trends. | +| R-202 | Exportable metrics and aggregates | Datadog metrics | Generalizes | Reframed from Datadog-specific metrics into the broader need to verify whether platform-defined metrics can be made available outside the platform, such as in analytics databases or external monitoring tools. | +| R-304 | Prompt version targeting and release channels | Per-request prompt override, Release labels | Generalizes | Merges controlled runtime override with movable release labels without assuming override must use HTTP headers. | +| R-305 | Expressive templating | Jinja2 templating | Generalizes | Preserves the need for dynamic templating while avoiding lock-in to one templating engine. | +| R-306 | Structured-output contract association | Structured-output schema validation | Generalizes | Focuses on managing and retrieving the output contract with the prompt, regardless of where validation executes. | +| R-307 | Workspace and folder management | Per domain workspace, Multi-workspace | Generalizes | Reframes prompt isolation as operational management of large prompt sets across workspaces and folders. | +| R-403 | Prompt organization and discoverability | Hierarchical organization | Generalizes | Preserves organizational structure without prescribing one hierarchy model. | +| R-404 | A/B testing | A/B testing | Generalizes | Keeps the need for A/B testing without prescribing one traffic-routing design. | +| R-602 | Filterable and query-driven dataset selection | Scenario heuristics | Generalizes | Keeps the need for systematic scenario selection while removing dependence on dbt TVFs or SQL implementation details. | +| R-704 | Cross-entity lineage navigation | Prompt-registry deep-links | Generalizes | Reframes a vendor-specific deep link into the broader need to navigate across related evaluation and prompt entities. | +| R-903 | Workspace-scoped credentials and access boundaries | API keys per workspace | Generalizes | Preserves isolation of programmatic credentials while allowing vendors to implement it with different auth models. | +| R-301 | Versioned prompt registry | Hosted prompt storage, Immutable versioning, Fetch prompts | Merges and broadens | Reframes server-side prompt storage, immutable publish semantics, and retrieval as one core versioned-registry requirement. | +| R-302 | Runtime prompt resolution | Fetch prompts, Per-request prompt override | Merges and generalizes | Combines baseline retrieval and runtime targeting into a broader runtime-resolution requirement. | +| R-402 | Reusable prompt components | Snippet versionning & labels, Snippet expansion (server-side), Snippet sharing across prompts, Snippet auto-propagation | Merges and generalizes | Reframes vendor-specific snippet behavior as the broader need for centrally managed reusable prompt components. | +| R-405 | Programmatic prompt-management interfaces | REST API, MCP, Python SDK | Merges and generalizes | Reframes several specific access surfaces as the broader need for programmable interfaces across application contexts, our own scripting and automation, and coding-agent-friendly tooling. | +| R-503 | Domain-metric expressiveness | Outcome metrics, Retrieval precision / recall, Multilabel macro precision / recall | Merges and generalizes | Reframes current business and task metrics as examples of broader custom metric expressiveness. | +| R-801 | Programmatic and CI evaluation execution | GitHub Actions workflows, Local runner, Persistence | Merges and generalizes | Merges external orchestration with programmatic run and result access: run evaluations in CI, from the CLI, or programmatically, and retrieve runs and outputs. | +| R-201 | Exportable raw observability data | Per-call request logging (BigQuery) | Replaces | Reframed from a specific warehouse sink to the broader requirement for raw-data export. | +| None | No direct requirement carried forward | Sentry error capture | Dropped | Treated as application-level error tracking rather than a platform selection criterion for LLM observability. | +| None | No direct requirement carried forward | OpenTelemetry tracing, Webhooks | Dropped | Treated as non-core integration surfaces for this evaluation unless a later procurement stage elevates them. | +| R-103 | Conversation and workflow grouping | None explicit | New | Added because isolated call logs are not enough for agent and support workflows. | +| R-106 | Retention and queryability | None explicit | New | Added because data usefulness depends on retention window and practical query access. | +| R-107 | Access control and sensitive-data handling | None explicit | New | Added because observability data can contain sensitive content and must be governed safely. | +| R-203 | ADK and Temporal tracing compatibility | None explicit | New | Added to require compatibility with Gorgias's custom ADK on Temporal, with instrumentation driven primarily from TypeScript rather than assuming a Python-first or vendor-owned runtime. | +| R-705 | UI-configurable experiment parameters | None explicit | New | Added so engineers can tune model parameters and hyperparameters for an offline evaluation from the UI without committing code. | +| R-706 | Domain-expert review usability | None explicit | New | Added to require that non-technical domain experts (mostly product managers) can directly participate in review, annotation, and evaluation workflows without relying on code-centric tooling. | diff --git a/docs/llm-observability-and-evaluation-platform-requirements.md b/docs/llm-observability-and-evaluation-platform-requirements.md new file mode 100644 index 0000000..38a892f --- /dev/null +++ b/docs/llm-observability-and-evaluation-platform-requirements.md @@ -0,0 +1,120 @@ +# LLM observability and evaluation platform requirements + +Canonical requirements for evaluating LLM observability and evaluation platforms at Gorgias. + +This document is vendor-agnostic. It defines the capabilities to validate, not the implementation details of the current stack. + +## Scope + +This document covers observability, prompt management, evaluation, governance, etc. + +Future revisions can add areas such as commercial terms, vendor support expectations, and compliance-specific constraints. + +## Evaluation principles + +- Define requirements in terms of capabilities, not current vendors, sinks, or internal instrumentation patterns. +- State what must be validated during vendor evaluation, not how Gorgias must implement it internally. +- Keep explicit non-requirements so the evaluation does not drift into adjacent tooling decisions. + +## Requirements + +### Native observability requirements + +| ID | Criterion | What we need to validate | +|---|---|---| +| R-101 | Full-fidelity LLM interaction capture | The platform captures each LLM interaction with enough detail for debugging, audit, and evaluation, including request and response payloads, errors, latency, token usage, model and inference parameters, tool calls, prompt version and label, and runtime metadata. Storage destination is not prescribed. | +| R-102 | Searchable call inspection | Users can search, filter, and inspect LLM interactions by operationally relevant dimensions such as time range, application or domain, model, prompt, outcome, error type, tags, and custom metadata. A web UI is preferred, but practical inspectability is the requirement. | +| R-103 | Conversation and workflow grouping | The platform can group related LLM interactions into larger units such as conversations, sessions, traces, workflows, or runs. Individual calls are not sufficient for many agent and support use cases. | +| R-104 | Trace and application-context correlation | LLM interactions can be correlated with higher-level application context such as request IDs, conversation IDs, customer or domain context, and upstream or downstream workflow steps. Correlation is required; a specific APM vendor is not. | +| R-105 | Operational metrics surface | The platform exposes operational metrics for LLM usage, including latency, success and error rate, token usage, and cost, at both call and aggregate levels. Aggregate views should support monitoring and investigation. | +| R-106 | Retention and queryability | Observability data is retained long enough to support investigation and evaluation and remains queryable at the scale Gorgias needs. Retention limits and scale constraints must be explicit. | +| R-107 | Access control and sensitive-data handling | The platform supports workspace or domain isolation, role-based access, and safe handling of sensitive data through redaction, masking, or equivalent controls. This is both an operational and governance requirement. | +| R-108 | Operational dashboard | The platform provides a built-in dashboard for visualizing operational metrics such as latency, error rate, token usage, and cost over time. Dashboarding must be native to the platform, not limited to metric export for external tools. | +| R-109 | Native alerting | The platform supports configuring alerts on operational metrics such as error rate, latency, or cost thresholds directly within the platform. This is a native platform capability, not a downstream concern delegated to an external system via metric export. | + +### Observability interoperability requirements + +| ID | Criterion | What we need to validate | +|---|---|---| +| R-201 | Exportable raw observability data | Raw LLM interaction data can be exported or synchronized to external systems for downstream analysis, warehousing, or internal tooling. Exportability matters, not a specific destination such as BigQuery. | +| R-202 | Exportable metrics and aggregates | If a metric is defined in the platform, we need to understand whether it can also be made available elsewhere, such as an analytics database or external monitoring and alerting systems like Datadog, rather than only living inside the platform. This includes latency, cost, usage, and outcome-oriented metrics where available. | +| R-203 | ADK and Temporal tracing compatibility | The platform's tracing and observability model is compatible with Gorgias's custom ADK on Temporal and with TypeScript-led instrumentation, so ADK-managed execution can be captured and correlated without replatforming the orchestration layer. Compatibility can come from a TypeScript SDK, manual instrumentation, or both. | + +### Prompt management requirements + +| ID | Criterion | What we need to validate | +|---|---|---| +| R-301 | Versioned prompt registry | The platform provides a server-side prompt registry with immutable versions so prompts can be managed, reviewed, and deployed in a controlled way. Reliable versioned retrieval matters more than the storage model. | +| R-302 | Runtime prompt resolution | Applications can retrieve prompts at runtime using a stable identifier and an explicit deployment target such as an environment, label, release channel, or exact version. Controlled resolution matters, not a specific lookup API shape. | +| R-303 | Prompt-linked execution configuration | A prompt definition can carry the runtime execution settings it needs, such as model selection, provider-specific options, inference parameters, tool configuration, and structured-output schema where applicable. This keeps the executable contract traceable in one place. | +| R-304 | Prompt version targeting and release channels | Prompt versions can be selected, overridden, promoted, and moved across named deployment labels or equivalent release channels without mutating the underlying immutable versions. This must support testing, debugging, staged rollout, and rollback. | +| R-305 | Expressive templating | Prompt definitions support the dynamic templating Gorgias needs, including variable substitution and basic control-flow constructs where necessary. Functional expressiveness matters, not Jinja2 compatibility by name. | +| R-306 | Structured-output contract association | The platform can associate a structured-output contract with a prompt and make that contract retrievable at runtime for validation and typed execution flows when structured outputs are used. Validation may happen in the platform, SDK, or client. | +| R-307 | Workspace and folder management | Prompts can be managed across separate workspaces, domains, or teams and grouped into folders or equivalent structures so large prompt sets remain manageable and independently owned sets do not interfere with one another. This is about operational manageability; security isolation is covered under governance. | + +### Prompt authoring and registry operations requirements + +| ID | Criterion | What we need to validate | +|---|---|---| +| R-401 | Prompt authoring and review UI | The platform provides a practical UI for creating, reviewing, diffing, and browsing prompt assets and their versions. Collaborative prompt operations matter, not a specific editor design. | +| R-402 | Reusable prompt components | Prompt assets can reference shared reusable components with independent versioning or update semantics so common content can be managed centrally. The mechanism does not need to be called snippets. | +| R-403 | Prompt organization and discoverability | Prompt assets can be organized and discovered using folders, namespaces, tags, or equivalent structures that scale beyond a flat list. The exact hierarchy model is not prescribed. | +| R-404 | A/B testing | The platform supports A/B testing across prompt variants, such as traffic splitting, comparative analysis, or equivalent mechanisms. The experimentation model may vary by vendor. | +| R-405 | Programmatic prompt-management interfaces | Prompt assets and related metadata can be created, retrieved, and managed through programmable interfaces suitable for applications, our own scripting and automation, and coding-agent-friendly tooling. API, SDK, CLI, MCP, or other agent-facing interfaces are all acceptable. | + +### Evaluation scoring requirements + +| ID | Criterion | What we need to validate | +|---|---|---| +| R-501 | Custom evaluation logic | The platform supports authoring and executing custom evaluation logic, including LLM-as-judge patterns and deterministic scorers, so Gorgias can express its own quality criteria. Scorer expressiveness matters, not a fixed catalog of built-in judges. | +| R-502 | Conditional scorer application | Different scorers can be applied selectively based on runtime attributes such as scenario, turn type, channel, or other record-level predicates. This avoids forcing every example through the same scorer set. | +| R-503 | Domain-metric expressiveness | The platform can express domain-specific outcome metrics and task-quality metrics even when they are not built-in score types. Examples include handover, automation, reopened rate, retrieval quality, tags, and other business-specific outcomes. | +| R-504 | Live online evaluation | The platform can run custom scorers against live production traffic on each execution, not only in offline batch experiments. This allows continuous quality monitoring without a separate evaluation pipeline. | + +### Evaluation dataset requirements + +| ID | Criterion | What we need to validate | +|---|---|---| +| R-601 | Trace- and log-derived dataset creation | The platform can create datasets from historical production traces, logs, or imported records rather than requiring fully manual example entry. The source does not need to be native vendor traffic. | +| R-602 | Filterable and query-driven dataset selection | Dataset construction supports expressive filtering or query-like selection over historical records so scenarios can be assembled systematically. Queryability matters, not SQL by name. | +| R-603 | Parameterized sampling and slicing | Dataset creation supports reproducible sampling and slicing by attributes such as date range, intent, channel, count, or other metadata dimensions. This is important for controlled comparisons and refreshable benchmarks. | +| R-604 | Ground-truth annotation workflows | The platform supports attaching reviewer annotations, labels, or expected outcomes to dataset items for later scoring and analysis. Human review may be native or integration-assisted. | +| R-605 | Dataset versioning and reproducibility | Dataset definitions and contents can be versioned or snapshotted so evaluation runs remain reproducible over time, including preserving the selection logic or resolved membership used for a run. | + +### Evaluation UI requirements + +| ID | Criterion | What we need to validate | +|---|---|---| +| R-701 | Evaluation run browser | Users can browse evaluation runs and inspect the metadata needed to compare them, such as status, dataset, configuration, agent version, and scoring summary. The goal is useful run-level visibility. | +| R-702 | Dataset browser and item inspection | Users can browse datasets and inspect the individual items, examples, or records they contain. This should support review, debugging, and auditability. | +| R-703 | Record-level evaluation inspection | Users can drill into an evaluated interaction or conversation and inspect scores, judgments, traces, and supporting evidence. This complements aggregate reporting with case-level debugging. | +| R-704 | Cross-entity lineage navigation | From a dataset item, evaluated conversation, or similar record, the UI can navigate to related entities such as the evaluation runs that used it, the underlying LLM calls, the prompts, and the scoring results. Connected investigation matters, not a specific deep-link implementation. | +| R-705 | UI-configurable experiment parameters | Engineers can set or override the model, parameters, and hyperparameters for an offline evaluation experiment directly in the platform UI and launch the run without code changes. This supports prompt iteration without a code-deploy cycle. | +| R-706 | Domain-expert review usability | Non-technical domain experts (mostly product managers) can use the platform to review evaluated records, compare outcomes, add annotations or feedback, and participate in evaluation workflows without writing code. Practical usability for non-engineering users is the requirement, not a specific UI design. | + +### Evaluation workflow boundary requirements + +| ID | Criterion | What we need to validate | +|---|---|---| +| R-801 | Programmatic and CI evaluation execution | Evaluations can be launched and managed from Gorgias-owned automation, including CI, the command line, and programmatic workflows, and runs, statuses, and outputs can be created, queried, and retrieved through an API, SDK, or equivalent. The platform does not need to replace GitHub Actions or local tooling. | +| R-802 | Shopper simulator compatibility | The platform should support shopper simulation work, including building an equivalent of Gorgias's current shopper simulator against the platform and evaluating the resulting conversations at the end of each simulation, without the vendor needing to own the simulation harness. | + +### Governance and enterprise access requirements + +| ID | Criterion | What we need to validate | +|---|---|---| +| R-901 | Role-based access control | The platform supports granular roles and permissions across workspaces, domains, or equivalent boundaries so engineering, operations, and review access can be separated appropriately. The exact role taxonomy may vary by vendor. | +| R-902 | Enterprise identity integration | The platform supports enterprise identity controls such as SSO or equivalent centralized authentication and user lifecycle management. This is required for operational rollout. | +| R-903 | Workspace-scoped credentials and access boundaries | Programmatic access can be scoped to the appropriate workspace, domain, or environment boundary through separate credentials or equivalent access controls. This matters for blast-radius control and operational isolation. | + +## Explicitly out of scope + +- Choosing a dedicated LLM gateway +- Duplicating application telemetry that already belongs to the broader Gorgias observability stack +- Requiring a specific prompt-cache storage backend, refresh strategy, or snapshot implementation +- Requiring the vendor to replace GitHub Actions, local developer runners, or Gorgias-owned storage conventions for evaluation artifacts +- Requiring the vendor to provide its own conversation simulator or shopper simulator, as long as external simulation outputs can be evaluated + +## Related documents + +- [Crosswalk to prior inventory](llm-observability-and-evaluation-platform-requirements-crosswalk.md) — maps each criterion in this document back to the prior inventory.