This folder documents eval authoring and benchmark execution in this repository.
For methodology details and scoring definitions, see ../paper/benchmark-methodology-whitepaper.tex.
Category status follows top-level folders under evals/: any group not present there is currently considered WIP.
If you are starting fresh, read these in order:
adding-new-eval.mdfor eval directory contract andrequirements.yamlbehavior.starter-scaffold-contract.mdfor baselineapp/starter policy.testing-your-evals.mdfor focused verification before opening a PR.adding-new-category.mdfor category README and requirement-design workflow.
For contribution workflow, command examples, and PR conventions, see ../CONTRIBUTING.md and ../AGENTS.md.