Skip to content

docs: define LLM observability eval requirements#1

Open
diephil wants to merge 4 commits into
mainfrom
diephil/con3-5-establish-clear-requirements-to-validate-against-for-the
Open

docs: define LLM observability eval requirements#1
diephil wants to merge 4 commits into
mainfrom
diephil/con3-5-establish-clear-requirements-to-validate-against-for-the

Conversation

@diephil
Copy link
Copy Markdown
Member

@diephil diephil commented Jun 2, 2026

Adds a vendor-agnostic requirements document for evaluating LLM observability and evaluation platforms at Gorgias. Adds a crosswalk mapping each new criterion back to the prior inventory and clarifying whether it is equivalent, broadened, generalized, merged, replaced, dropped, or newly introduced. The scope stays focused on capabilities to validate during vendor evaluation rather than implementation details from the current stack.

@linear-code
Copy link
Copy Markdown

linear-code Bot commented Jun 2, 2026

CON3-5

@diephil diephil requested a review from r0ots June 2, 2026 13:21
@@ -0,0 +1,50 @@
# Crosswalk: LLM observability and evaluation platform requirements to prior inventory
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r0ots this is the file I used to "convert" the features listed in your notion doc, and convert them into more agnostic requirements that are not too tied to Gorgias specifics.

You will find the "Legacy item" (coming from the notion doc) and the new criterion + explanation of either it's expanded to be more generic, or dropped, or anything else.

The other file is the "final" list of criteria

@diephil diephil force-pushed the diephil/con3-5-establish-clear-requirements-to-validate-against-for-the branch from 2842422 to f85dc87 Compare June 2, 2026 18:24
|---|---|
| Evaluation run browser | Users can browse evaluation runs and inspect the metadata needed to compare them, such as status, dataset, configuration, agent version, and scoring summary. The goal is useful run-level visibility. |
| Dataset browser and item inspection | Users can browse datasets and inspect the individual items, examples, or records they contain. This should support review, debugging, and auditability. |
| Record-level evaluation inspection | Users can drill into an evaluated interaction or conversation and inspect scores, judgments, traces, and supporting evidence. This complements aggregate reporting with case-level debugging. |
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one sounds like a duplicate of the previous one (that says: "users can browse dataset [...] inspect the individual items")

But here it's focused on the given evaluation of multiple scenario, and inspect each evaluated conversations.

| Evaluation run browser | Users can browse evaluation runs and inspect the metadata needed to compare them, such as status, dataset, configuration, agent version, and scoring summary. The goal is useful run-level visibility. |
| Dataset browser and item inspection | Users can browse datasets and inspect the individual items, examples, or records they contain. This should support review, debugging, and auditability. |
| Record-level evaluation inspection | Users can drill into an evaluated interaction or conversation and inspect scores, judgments, traces, and supporting evidence. This complements aggregate reporting with case-level debugging. |
| Cross-entity lineage navigation | From a dataset item, evaluated conversation, or similar record, the UI can navigate to related entities such as the evaluation runs that used it, the underlying LLM calls, the prompts, and the scoring results. Connected investigation matters, not a specific deep-link implementation. |
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is linked to the painpoint of MLA of having to jump between tools to inspect: prompt, annotations, scores, etc... - original painpoint

@diephil diephil marked this pull request as ready for review June 2, 2026 18:42
| Dataset browser and item inspection | Users can browse datasets and inspect the individual items, examples, or records they contain. This should support review, debugging, and auditability. |
| Record-level evaluation inspection | Users can drill into an evaluated interaction or conversation and inspect scores, judgments, traces, and supporting evidence. This complements aggregate reporting with case-level debugging. |
| Cross-entity lineage navigation | From a dataset item, evaluated conversation, or similar record, the UI can navigate to related entities such as the evaluation runs that used it, the underlying LLM calls, the prompts, and the scoring results. Connected investigation matters, not a specific deep-link implementation. |
| UI-configurable experiment parameters | Engineers can set or override the model, parameters, and hyperparameters for an offline evaluation experiment directly in the platform UI and launch the run without code changes. This supports prompt iteration without a code-deploy cycle. |
Copy link
Copy Markdown
Member Author

@diephil diephil Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this one for the hyper parameter topic. AFAIK barintrust are the only ones providing a primitive to provide those runtime experiment parameters but maybe it has changed since


| Criterion | What we need to validate |
|---|---|
| Trace- and log-derived dataset creation | The platform can create datasets from historical production traces, logs, or imported records rather than requiring fully manual example entry. The source does not need to be native vendor traffic. |
Copy link
Copy Markdown
Member Author

@diephil diephil Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where our current heuristic will be migrated to: instead of inspecting BQ we'll inspect the traces + their metadata. As an example with braintrust here at 1:35


| New criterion | Legacy item(s) | Relationship | Change note |
|---|---|---|---|
| Conditional scorer application | Conditional judge selection | Equivalent | Keeps conditional activation as a direct requirement while broadening it beyond judges. |
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add new columns with identifier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant