Skip to content

AI-in-Health/BioMedArena

Repository files navigation

BioMedArena

If you find BioMedArena useful, please give us a star on GitHub for the latest updates.

arXiv GitHub Benchmarks Tools

Building a deep research agent today is an exercise in glue code: the same backbone evaluated on the same benchmark can report different accuracies in different papers because the harness and tool registry differ, and integrating a new foundation model into a comparable evaluation surface can cost weeks of model-specific engineering. We call this the per-paper engineering tax and release BioMedArena, an open-source toolkit and arena for fair comparison of foundation models as biomedical deep-research agents.

BioMedArena decouples six layers of biomedical agent evaluation: benchmark loading, tool exposure, tool selection, harness mode, context management, and scoring. The current public code exposes 166 registered benchmark entries (155 canonical benchmarks plus 11 deprecated compatibility aliases), 76 tools across 9 biomedical functional families, 4 modes, and 9 registered model backbone IDs. Adding a new model, benchmark, or tool reduces to registering a small provider adapter, loader, or schema/handler pair.

BioMedArena overall benchmark performance across public and commercial LLMs

Overall Performance

Scroll horizontally to view all benchmark columns.

Type Model Setting HealthBench Hard
(1000)
MedXpertQA
(2450)
ProteinLMBench
(944)
Medbullets
(308)
SuperChem (500) BixBench
(205)
HLE-Gold (149) LAB-Bench 2
(821)
Text Image Total Bio Chem Total
SOTA - 42.8 71.5 62.2 57.5 63.2 - 38.5 80.5 44.2 53.2 46.8 80.0
Public LLMs Trinity-Large-Thinking Baseline -------5.415.07.112.843.5
Trinity-Large-Thinking Ours -------26.319.619.019.561.0
NVIDIA Nemotron-3 Super 120B Baseline -------13.213.14.810.743.6
NVIDIA Nemotron-3 Super 120B Ours -------10.728.033.329.562.6
INTELLECT-3.1 Baseline -------5.420.69.517.443.5
INTELLECT-3.1 Ours -------20.026.219.024.257.9
GLM-4.5 Baseline -------23.916.82.412.844.8
GLM-4.5 Ours -------30.228.028.628.261.5
Qwen3.5-397B-A17B Baseline -------22.415.014.314.842.5
Qwen3.5-397B-A17B Ours -------35.128.014.324.257.9
Commercial LLMs GPT-5.4 Baseline ----57.722.641.240.043.033.340.348.5
GPT-5.4 Ours ----66.858.362.849.848.654.850.372.1
Gemini 3 Flash Baseline 59.752.965.764.949.110.230.838.539.328.636.251.9
Gemini 3 Flash Ours 80.763.672.185.454.034.544.869.351.447.650.359.8
Gemini 3.1 Pro Baseline 72.058.970.370.154.319.237.842.441.147.643.052.1
Gemini 3.1 Pro Ours 80.872.077.091.657.462.659.885.949.550.049.770.9
Claude Sonnet 4.5 Baseline 67.945.958.788.329.117.423.617.122.416.720.848.1
Claude Sonnet 4.5 Ours 75.360.172.891.239.631.935.442.442.140.541.671.5
Claude Sonnet 4.6 Baseline 69.351.060.386.040.424.733.040.524.321.423.549.3
Claude Sonnet 4.6 Ours 86.062.464.789.066.053.658.648.342.150.044.374.3
Claude Opus 4.5 Baseline 72.549.463.887.348.324.337.041.030.831.030.949.8
Claude Opus 4.5 Ours 78.962.069.890.359.241.350.850.249.550.049.773.8
Claude Opus 4.6 Baseline 76.255.864.489.962.330.247.242.036.442.938.351.2
Claude Opus 4.6 Ours 80.266.071.892.272.857.965.853.755.159.556.482.3

Documentation

The root README stays short on purpose. Detailed release information lives in docs/:

Quick Check

After installing dependencies, run the offline smoke suite:

python3 scripts/run_quick_suite.py

Expected healthy output:

  • 166 registered benchmarks
  • 76 registered tools
  • 4 registered modes
  • 20/20 scorer checks passed

For the stricter offline release gate:

python3 scripts/release_gate.py --strict

Prepare the full BixBench agent setting only when you need the official capsule-backed protocol. This downloads the large CapsuleFolder-{uuid}.zip files explicitly instead of during normal benchmark loading:

biomedarena prepare-bixbench --revision main --extract
docker build -t biomedarena/bixbench-sandbox:latest docker/bixbench
biomedarena run --benchmark bixbench --bixbench-form open --bixbench-capsules \
  --backbone gemini-3-flash-preview --tools biomed --reasoning-mode heavy

If you are running fully offline from a cache you prepared yourself, add --bixbench-offline-metadata to the run command. The open-form path is official-compatible: it uses the public BixBench rows, mounted data capsules, and eval_mode-specific scoring, while the external FutureHouse evaluator is not vendored in this repository.

Installation

git clone https://github.com/AI-in-Health/BioMedArena.git
cd BioMedArena

python3.11 -m venv .venv
source .venv/bin/activate

python -m pip install -e ".[dev,eval,provider-gemini]"

cp .env.example .env

Fill at least one model provider key in .env:

OPENAI_API_KEY=<your-openai-api-key>
ANTHROPIC_API_KEY=<your-anthropic-api-key>
GEMINI_API_KEY=<your-gemini-api-key>
HF_TOKEN=<your-huggingface-token-for-gated-benchmarks>

Gated HuggingFace datasets also require accepting the dataset terms in the browser before HF_TOKEN can load them. See .env.example for optional domain-specific keys such as NCBI, OMIM, Serper, and Jina.

Basic Usage

List available resources:

biomedarena list-benchmarks
biomedarena list-backbones
biomedarena list-modes

The package name and command-line entry point are both biomedarena. Environment variables use the BIOMEDARENA_ prefix.

Run one benchmark cell:

biomedarena run \
  --benchmark medcalc \
  --backbone gemini-2.5-flash \
  --tools biomed --reasoning-mode light \
  --limit 5 \
  --output result.json

Run a small matrix cell:

python3 scripts/run_matrix.py \
  --config configs/matrix_default.yaml \
  --only medcalc,gemini,simple_llm \
  --limit-override 1

For the 7-setting quick-start experiment suite used to compare thinking, domain tools, web search, and combined tool use, see quick_run.sh.

Check official source accessibility before spending model budget:

python3 scripts/verify_benchmark_sources.py --benchmarks all

Execution Modes

The public CLI exposes four modes:

Mode Purpose
simple_llm Pure model baseline, no tools.
deep_think Native model reasoning/thinking path where supported.
light Single-turn function/tool calling.
heavy Multi-turn ReAct loop with tool retrieval.

A unified CLI interface is also available via --tools / --reasoning-mode / --enable-thinking flags, which map to the modes above:

--tools --reasoning-mode Internal mode Thinking
off (n/a) deep_think ON (default)
off + --enable-thinking 0 (n/a) simple_llm OFF
biomed / search / all light light OFF
biomed / search / all heavy heavy ON

The legacy --mode / --web-tools flags remain supported for backward compatibility. Add --self-consistency to wrap any mode with majority voting.

Security

python_exec can execute model-supplied Python with timeout and basic denylist checks. Treat this as a convenience guard, not a hardened sandbox. Run untrusted workloads in an isolated container or VM, keep secrets out of the working directory, and disable code-execution or web-search tools for private data unless you have reviewed the policy.

External tools may call third-party APIs and public databases. Review the benchmark and tool inventories before running sensitive workloads.

Testing

python3 scripts/run_quick_suite.py
python3 scripts/release_gate.py --strict
python3 -m pytest tests/unit -q
HF_HOME=/tmp/biomedarena_hf_empty \
HF_DATASETS_CACHE=/tmp/biomedarena_hf_datasets_empty \
HF_TOKEN= HUGGING_FACE_HUB_TOKEN= HUGGINGFACE_HUB_TOKEN= \
python3 -m pytest tests/smoke -q -m "not slow"

Citation

@article{wu2026biomedarena,
  title={BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents},
  author={Wu, J and Zhou, H and Zeng, M and Zhu, J and Wu, J and Pan, J and Wu, S and Wu, H and Liu, F and Clifton, D A},
  journal={arXiv preprint arXiv:2605.06177},
  year={2026}
}

License

See LICENSE. Ported life-science skill attribution is tracked in harness/tools/openai_ported/NOTICE.md.

About

BioMedArena: a state-of-the-art biomedical harness for evaluating AI agents at scale - 100+ benchmarks, 70+ tools

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages