This repository contains the code for multiple Preparedness evals that use nanoeval and alcatraz.
Python 3.11 (3.12 is untested; 3.13 will break chz)
- PaperBench: based on https://openai.com/index/paperbench/
- SWELancer : based on https://openai.com/index/swe-lancer/
- MLE-bench (Forthcoming): based on https://openai.com/index/mle-bench/
- SWE-bench (Forthcoming): based on https://openai.com/index/introducing-swe-bench-verified/