Skip to content

[feature] AI Agent Evaluations #4553

@ivicac

Description

@ivicac

Summary

Add an automated evaluation system for ByteChef AI agents with test scenarios, LLM/deterministic judges, async run execution, and a dedicated UI panel — integrated with spring-ai-community/agent-judge.

Key Features

  • Test Scenarios: Single-turn (message → response) and multi-turn (simulated conversation) scenarios
  • Two-Level Judges: Agent-level judges (run on all scenarios) + scenario-level judges (scoped)
  • Judge Types: LLM rule-based + deterministic (contains text, regex, response length, JSON schema, similarity)
  • Async Execution: Runs execute asynchronously with progress tracking and cancellation
  • Results & History: Score tracking, judge verdicts with explanations, conversation transcript storage
  • UI Panel: New "Evals" tab in AI Agent Editor with Tests/Judges/Runs sub-tabs

Design Spec

docs/superpowers/specs/2026-03-15-agent-evaluations-design.md

Implementation Plan

docs/superpowers/plans/2026-03-15-agent-evaluations.md

Phase 1 Scope

  • Agent-level evaluations only (workflow-level deferred)
  • Sequential scenario execution (parallel deferred)
  • Informational results only (save gating deferred)
  • TOOL_USAGE judge deferred (requires structured tool event capture)

Tech Stack

  • Backend: Spring Boot 4, Spring Data JDBC, Spring AI, spring-ai-community/agent-judge
  • Frontend: React 19, TypeScript, Zustand, TanStack Query, GraphQL

Metadata

Metadata

Assignees

Labels

aiArtificial InteligencebackendConcerning any and all backend issuesenhancementNew feature or requestfrontendConcerning any and all frontend issues

Projects

Status

Quarterly Release

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions