Holdout tests are more rigorous evaluations and are intentionally hidden from the agent being tested.
We should add support in the framework to run these holdout tests. For each optimized task implementation, the framework would launch an independent third-party agent to evaluate the optimized code using the holdout tests and generate a final evaluation report. This helps ensure a more robust and unbiased assessment of both correctness and performance.