Skip to content

Add Holdout Test Evaluation via Independent Agent #9

@irvineoy

Description

@irvineoy

Holdout tests are more rigorous evaluations and are intentionally hidden from the agent being tested.

We should add support in the framework to run these holdout tests. For each optimized task implementation, the framework would launch an independent third-party agent to evaluate the optimized code using the holdout tests and generate a final evaluation report. This helps ensure a more robust and unbiased assessment of both correctness and performance.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions