Add Holdout Test Evaluation via Independent Agent

Holdout tests are more rigorous evaluations and are intentionally hidden from the agent being tested.

We should add support in the framework to run these holdout tests. For each optimized task implementation, the framework would launch an independent third-party agent to evaluate the optimized code using the holdout tests and generate a final evaluation report. This helps ensure a more robust and unbiased assessment of both correctness and performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Holdout Test Evaluation via Independent Agent #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add Holdout Test Evaluation via Independent Agent #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions