-
Notifications
You must be signed in to change notification settings - Fork 137
Open
Labels
aiArtificial InteligenceArtificial InteligencebackendConcerning any and all backend issuesConcerning any and all backend issuesenhancementNew feature or requestNew feature or requestfrontendConcerning any and all frontend issuesConcerning any and all frontend issues
Description
Summary
Add an automated evaluation system for ByteChef AI agents with test scenarios, LLM/deterministic judges, async run execution, and a dedicated UI panel — integrated with spring-ai-community/agent-judge.
Key Features
- Test Scenarios: Single-turn (message → response) and multi-turn (simulated conversation) scenarios
- Two-Level Judges: Agent-level judges (run on all scenarios) + scenario-level judges (scoped)
- Judge Types: LLM rule-based + deterministic (contains text, regex, response length, JSON schema, similarity)
- Async Execution: Runs execute asynchronously with progress tracking and cancellation
- Results & History: Score tracking, judge verdicts with explanations, conversation transcript storage
- UI Panel: New "Evals" tab in AI Agent Editor with Tests/Judges/Runs sub-tabs
Design Spec
docs/superpowers/specs/2026-03-15-agent-evaluations-design.md
Implementation Plan
docs/superpowers/plans/2026-03-15-agent-evaluations.md
Phase 1 Scope
- Agent-level evaluations only (workflow-level deferred)
- Sequential scenario execution (parallel deferred)
- Informational results only (save gating deferred)
- TOOL_USAGE judge deferred (requires structured tool event capture)
Tech Stack
- Backend: Spring Boot 4, Spring Data JDBC, Spring AI, spring-ai-community/agent-judge
- Frontend: React 19, TypeScript, Zustand, TanStack Query, GraphQL
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
aiArtificial InteligenceArtificial InteligencebackendConcerning any and all backend issuesConcerning any and all backend issuesenhancementNew feature or requestNew feature or requestfrontendConcerning any and all frontend issuesConcerning any and all frontend issues
Type
Projects
Status
Quarterly Release