docs: define LLM observability eval requirements by diephil · Pull Request #1 · gorgias/ai-agent-stack

diephil · 2026-06-02T13:19:21Z

Adds a vendor-agnostic requirements document for evaluating LLM observability and evaluation platforms at Gorgias. Adds a crosswalk mapping each new criterion back to the prior inventory and clarifying whether it is equivalent, broadened, generalized, merged, replaced, dropped, or newly introduced. The scope stays focused on capabilities to validate during vendor evaluation rather than implementation details from the current stack.

linear-code · 2026-06-02T13:19:25Z

CON3-5

diephil · 2026-06-02T18:21:54Z

@@ -0,0 +1,50 @@
+# Crosswalk: LLM observability and evaluation platform requirements to prior inventory


@r0ots this is the file I used to "convert" the features listed in your notion doc, and convert them into more agnostic requirements that are not too tied to Gorgias specifics.

You will find the "Legacy item" (coming from the notion doc) and the new criterion + explanation of either it's expanded to be more generic, or dropped, or anything else.

The other file is the "final" list of criteria

diephil · 2026-06-02T18:40:42Z

+|---|---|
+| Evaluation run browser | Users can browse evaluation runs and inspect the metadata needed to compare them, such as status, dataset, configuration, agent version, and scoring summary. The goal is useful run-level visibility. |
+| Dataset browser and item inspection | Users can browse datasets and inspect the individual items, examples, or records they contain. This should support review, debugging, and auditability. |
+| Record-level evaluation inspection | Users can drill into an evaluated interaction or conversation and inspect scores, judgments, traces, and supporting evidence. This complements aggregate reporting with case-level debugging. |


This one sounds like a duplicate of the previous one (that says: "users can browse dataset [...] inspect the individual items")

But here it's focused on the given evaluation of multiple scenario, and inspect each evaluated conversations.

diephil · 2026-06-02T18:41:38Z

+| Evaluation run browser | Users can browse evaluation runs and inspect the metadata needed to compare them, such as status, dataset, configuration, agent version, and scoring summary. The goal is useful run-level visibility. |
+| Dataset browser and item inspection | Users can browse datasets and inspect the individual items, examples, or records they contain. This should support review, debugging, and auditability. |
+| Record-level evaluation inspection | Users can drill into an evaluated interaction or conversation and inspect scores, judgments, traces, and supporting evidence. This complements aggregate reporting with case-level debugging. |
+| Cross-entity lineage navigation | From a dataset item, evaluated conversation, or similar record, the UI can navigate to related entities such as the evaluation runs that used it, the underlying LLM calls, the prompts, and the scoring results. Connected investigation matters, not a specific deep-link implementation. |


this is linked to the painpoint of MLA of having to jump between tools to inspect: prompt, annotations, scores, etc... - original painpoint

diephil · 2026-06-02T18:47:49Z

+| Dataset browser and item inspection | Users can browse datasets and inspect the individual items, examples, or records they contain. This should support review, debugging, and auditability. |
+| Record-level evaluation inspection | Users can drill into an evaluated interaction or conversation and inspect scores, judgments, traces, and supporting evidence. This complements aggregate reporting with case-level debugging. |
+| Cross-entity lineage navigation | From a dataset item, evaluated conversation, or similar record, the UI can navigate to related entities such as the evaluation runs that used it, the underlying LLM calls, the prompts, and the scoring results. Connected investigation matters, not a specific deep-link implementation. |
+| UI-configurable experiment parameters | Engineers can set or override the model, parameters, and hyperparameters for an offline evaluation experiment directly in the platform UI and launch the run without code changes. This supports prompt iteration without a code-deploy cycle. |


I added this one for the hyper parameter topic. AFAIK barintrust are the only ones providing a primitive to provide those runtime experiment parameters but maybe it has changed since

diephil · 2026-06-02T18:52:48Z

+
+| Criterion | What we need to validate |
+|---|---|
+| Trace- and log-derived dataset creation | The platform can create datasets from historical production traces, logs, or imported records rather than requiring fully manual example entry. The source does not need to be native vendor traffic. |


This is where our current heuristic will be migrated to: instead of inspecting BQ we'll inspect the traces + their metadata. As an example with braintrust here at 1:35

diephil · 2026-06-04T08:12:09Z

+
+| New criterion | Legacy item(s) | Relationship | Change note |
+|---|---|---|---|
+| Conditional scorer application | Conditional judge selection | Equivalent | Keeps conditional activation as a direct requirement while broadening it beyond judges. |


Add new columns with identifier

diephil requested a review from r0ots June 2, 2026 13:21

diephil commented Jun 2, 2026

View reviewed changes

docs: define LLM observability eval requirements

f85dc87

diephil force-pushed the diephil/con3-5-establish-clear-requirements-to-validate-against-for-the branch from 2842422 to f85dc87 Compare June 2, 2026 18:24

docs: refine evaluation wording in readme

61fa443

diephil commented Jun 2, 2026

View reviewed changes

diephil marked this pull request as ready for review June 2, 2026 18:42

diephil commented Jun 2, 2026

View reviewed changes

diephil commented Jun 4, 2026

View reviewed changes

diephil added 2 commits June 4, 2026 15:17

docs: add requirement IDs and refine wording

d1770c6

docs: add domain expert evaluation requirement

338ce35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: define LLM observability eval requirements#1

docs: define LLM observability eval requirements#1
diephil wants to merge 4 commits into
mainfrom
diephil/con3-5-establish-clear-requirements-to-validate-against-for-the

diephil commented Jun 2, 2026

Uh oh!

linear-code Bot commented Jun 2, 2026

Uh oh!

diephil Jun 2, 2026

Uh oh!

diephil Jun 2, 2026

Uh oh!

diephil Jun 2, 2026

Uh oh!

diephil Jun 2, 2026 •

edited

Loading

Uh oh!

diephil Jun 2, 2026 •

edited

Loading

Uh oh!

diephil Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

		@@ -0,0 +1,50 @@
		# Crosswalk: LLM observability and evaluation platform requirements to prior inventory

Conversation

diephil commented Jun 2, 2026

Uh oh!

linear-code Bot commented Jun 2, 2026

Uh oh!

diephil Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

diephil Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

diephil Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

diephil Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

diephil Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

diephil Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

diephil Jun 2, 2026 •

edited

Loading

diephil Jun 2, 2026 •

edited

Loading