docs: define LLM observability eval requirements#1
Conversation
| @@ -0,0 +1,50 @@ | |||
| # Crosswalk: LLM observability and evaluation platform requirements to prior inventory | |||
There was a problem hiding this comment.
@r0ots this is the file I used to "convert" the features listed in your notion doc, and convert them into more agnostic requirements that are not too tied to Gorgias specifics.
You will find the "Legacy item" (coming from the notion doc) and the new criterion + explanation of either it's expanded to be more generic, or dropped, or anything else.
The other file is the "final" list of criteria
2842422 to
f85dc87
Compare
| |---|---| | ||
| | Evaluation run browser | Users can browse evaluation runs and inspect the metadata needed to compare them, such as status, dataset, configuration, agent version, and scoring summary. The goal is useful run-level visibility. | | ||
| | Dataset browser and item inspection | Users can browse datasets and inspect the individual items, examples, or records they contain. This should support review, debugging, and auditability. | | ||
| | Record-level evaluation inspection | Users can drill into an evaluated interaction or conversation and inspect scores, judgments, traces, and supporting evidence. This complements aggregate reporting with case-level debugging. | |
There was a problem hiding this comment.
This one sounds like a duplicate of the previous one (that says: "users can browse dataset [...] inspect the individual items")
But here it's focused on the given evaluation of multiple scenario, and inspect each evaluated conversations.
| | Evaluation run browser | Users can browse evaluation runs and inspect the metadata needed to compare them, such as status, dataset, configuration, agent version, and scoring summary. The goal is useful run-level visibility. | | ||
| | Dataset browser and item inspection | Users can browse datasets and inspect the individual items, examples, or records they contain. This should support review, debugging, and auditability. | | ||
| | Record-level evaluation inspection | Users can drill into an evaluated interaction or conversation and inspect scores, judgments, traces, and supporting evidence. This complements aggregate reporting with case-level debugging. | | ||
| | Cross-entity lineage navigation | From a dataset item, evaluated conversation, or similar record, the UI can navigate to related entities such as the evaluation runs that used it, the underlying LLM calls, the prompts, and the scoring results. Connected investigation matters, not a specific deep-link implementation. | |
There was a problem hiding this comment.
this is linked to the painpoint of MLA of having to jump between tools to inspect: prompt, annotations, scores, etc... - original painpoint
| | Dataset browser and item inspection | Users can browse datasets and inspect the individual items, examples, or records they contain. This should support review, debugging, and auditability. | | ||
| | Record-level evaluation inspection | Users can drill into an evaluated interaction or conversation and inspect scores, judgments, traces, and supporting evidence. This complements aggregate reporting with case-level debugging. | | ||
| | Cross-entity lineage navigation | From a dataset item, evaluated conversation, or similar record, the UI can navigate to related entities such as the evaluation runs that used it, the underlying LLM calls, the prompts, and the scoring results. Connected investigation matters, not a specific deep-link implementation. | | ||
| | UI-configurable experiment parameters | Engineers can set or override the model, parameters, and hyperparameters for an offline evaluation experiment directly in the platform UI and launch the run without code changes. This supports prompt iteration without a code-deploy cycle. | |
There was a problem hiding this comment.
I added this one for the hyper parameter topic. AFAIK barintrust are the only ones providing a primitive to provide those runtime experiment parameters but maybe it has changed since
|
|
||
| | Criterion | What we need to validate | | ||
| |---|---| | ||
| | Trace- and log-derived dataset creation | The platform can create datasets from historical production traces, logs, or imported records rather than requiring fully manual example entry. The source does not need to be native vendor traffic. | |
There was a problem hiding this comment.
This is where our current heuristic will be migrated to: instead of inspecting BQ we'll inspect the traces + their metadata. As an example with braintrust here at 1:35
|
|
||
| | New criterion | Legacy item(s) | Relationship | Change note | | ||
| |---|---|---|---| | ||
| | Conditional scorer application | Conditional judge selection | Equivalent | Keeps conditional activation as a direct requirement while broadening it beyond judges. | |
There was a problem hiding this comment.
Add new columns with identifier
Adds a vendor-agnostic requirements document for evaluating LLM observability and evaluation platforms at Gorgias. Adds a crosswalk mapping each new criterion back to the prior inventory and clarifying whether it is equivalent, broadened, generalized, merged, replaced, dropped, or newly introduced. The scope stays focused on capabilities to validate during vendor evaluation rather than implementation details from the current stack.