BioMedArena

BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

If you find BioMedArena useful, please give us a star on GitHub for the latest updates.

Building a deep research agent today is an exercise in glue code: the same backbone evaluated on the same benchmark can report different accuracies in different papers because the harness and tool registry differ, and integrating a new foundation model into a comparable evaluation surface can cost weeks of model-specific engineering. We call this the per-paper engineering tax and release BioMedArena, an open-source toolkit and arena for fair comparison of foundation models as biomedical deep-research agents.

BioMedArena decouples six layers of biomedical agent evaluation: benchmark loading, tool exposure, tool selection, harness mode, context management, and scoring. The current public code exposes 166 registered benchmark entries (155 canonical benchmarks plus 11 deprecated compatibility aliases), 76 tools across 9 biomedical functional families, 4 modes, and 9 registered model backbone IDs. Adding a new model, benchmark, or tool reduces to registering a small provider adapter, loader, or schema/handler pair.

Overall Performance

Scroll horizontally to view all benchmark columns.

Type	Model	Setting	HealthBench Hard (1000)	MedXpertQA (2450)	ProteinLMBench (944)	Medbullets (308)	SuperChem (500)			BixBench (205)	HLE-Gold (149)			LAB-Bench 2 (821)
Type	Model	Setting	HealthBench Hard (1000)	MedXpertQA (2450)	ProteinLMBench (944)	Medbullets (308)	Text	Image	Total	BixBench (205)	Bio	Chem	Total	LAB-Bench 2 (821)
	SOTA	-	42.8	71.5	62.2	57.5	63.2	-	38.5	80.5	44.2	53.2	46.8	80.0
Public LLMs	Trinity-Large-Thinking	Baseline	-	-	-	-	-	-	-	5.4	15.0	7.1	12.8	43.5
	Trinity-Large-Thinking	Ours	-	-	-	-	-	-	-	26.3	19.6	19.0	19.5	61.0
	NVIDIA Nemotron-3 Super 120B	Baseline	-	-	-	-	-	-	-	13.2	13.1	4.8	10.7	43.6
	NVIDIA Nemotron-3 Super 120B	Ours	-	-	-	-	-	-	-	10.7	28.0	33.3	29.5	62.6
	INTELLECT-3.1	Baseline	-	-	-	-	-	-	-	5.4	20.6	9.5	17.4	43.5
	INTELLECT-3.1	Ours	-	-	-	-	-	-	-	20.0	26.2	19.0	24.2	57.9
	GLM-4.5	Baseline	-	-	-	-	-	-	-	23.9	16.8	2.4	12.8	44.8
	GLM-4.5	Ours	-	-	-	-	-	-	-	30.2	28.0	28.6	28.2	61.5
	Qwen3.5-397B-A17B	Baseline	-	-	-	-	-	-	-	22.4	15.0	14.3	14.8	42.5
	Qwen3.5-397B-A17B	Ours	-	-	-	-	-	-	-	35.1	28.0	14.3	24.2	57.9
Commercial LLMs	GPT-5.4	Baseline	-	-	-	-	57.7	22.6	41.2	40.0	43.0	33.3	40.3	48.5
	GPT-5.4	Ours	-	-	-	-	66.8	58.3	62.8	49.8	48.6	54.8	50.3	72.1
	Gemini 3 Flash	Baseline	59.7	52.9	65.7	64.9	49.1	10.2	30.8	38.5	39.3	28.6	36.2	51.9
	Gemini 3 Flash	Ours	80.7	63.6	72.1	85.4	54.0	34.5	44.8	69.3	51.4	47.6	50.3	59.8
	Gemini 3.1 Pro	Baseline	72.0	58.9	70.3	70.1	54.3	19.2	37.8	42.4	41.1	47.6	43.0	52.1
	Gemini 3.1 Pro	Ours	80.8	72.0	77.0	91.6	57.4	62.6	59.8	85.9	49.5	50.0	49.7	70.9
	Claude Sonnet 4.5	Baseline	67.9	45.9	58.7	88.3	29.1	17.4	23.6	17.1	22.4	16.7	20.8	48.1
	Claude Sonnet 4.5	Ours	75.3	60.1	72.8	91.2	39.6	31.9	35.4	42.4	42.1	40.5	41.6	71.5
	Claude Sonnet 4.6	Baseline	69.3	51.0	60.3	86.0	40.4	24.7	33.0	40.5	24.3	21.4	23.5	49.3
	Claude Sonnet 4.6	Ours	86.0	62.4	64.7	89.0	66.0	53.6	58.6	48.3	42.1	50.0	44.3	74.3
	Claude Opus 4.5	Baseline	72.5	49.4	63.8	87.3	48.3	24.3	37.0	41.0	30.8	31.0	30.9	49.8
	Claude Opus 4.5	Ours	78.9	62.0	69.8	90.3	59.2	41.3	50.8	50.2	49.5	50.0	49.7	73.8
	Claude Opus 4.6	Baseline	76.2	55.8	64.4	89.9	62.3	30.2	47.2	42.0	36.4	42.9	38.3	51.2
	Claude Opus 4.6	Ours	80.2	66.0	71.8	92.2	72.8	57.9	65.8	53.7	55.1	59.5	56.4	82.3

Documentation

The root README stays short on purpose. Detailed release information lives in docs/:

Quick Check

After installing dependencies, run the offline smoke suite:

python3 scripts/run_quick_suite.py

Expected healthy output:

166 registered benchmarks
76 registered tools
4 registered modes
20/20 scorer checks passed

For the stricter offline release gate:

python3 scripts/release_gate.py --strict

Prepare the full BixBench agent setting only when you need the official capsule-backed protocol. This downloads the large CapsuleFolder-{uuid}.zip files explicitly instead of during normal benchmark loading:

biomedarena prepare-bixbench --revision main --extract
docker build -t biomedarena/bixbench-sandbox:latest docker/bixbench
biomedarena run --benchmark bixbench --bixbench-form open --bixbench-capsules \
  --backbone gemini-3-flash-preview --tools biomed --reasoning-mode heavy

If you are running fully offline from a cache you prepared yourself, add --bixbench-offline-metadata to the run command. The open-form path is official-compatible: it uses the public BixBench rows, mounted data capsules, and eval_mode-specific scoring, while the external FutureHouse evaluator is not vendored in this repository.

Installation

git clone https://github.com/AI-in-Health/BioMedArena.git
cd BioMedArena

python3.11 -m venv .venv
source .venv/bin/activate

python -m pip install -e ".[dev,eval,provider-gemini]"

cp .env.example .env

Fill at least one model provider key in .env:

OPENAI_API_KEY=<your-openai-api-key>
ANTHROPIC_API_KEY=<your-anthropic-api-key>
GEMINI_API_KEY=<your-gemini-api-key>
HF_TOKEN=<your-huggingface-token-for-gated-benchmarks>

Gated HuggingFace datasets also require accepting the dataset terms in the browser before HF_TOKEN can load them. See .env.example for optional domain-specific keys such as NCBI, OMIM, Serper, and Jina.

Basic Usage

List available resources:

biomedarena list-benchmarks
biomedarena list-backbones
biomedarena list-modes

The package name and command-line entry point are both biomedarena. Environment variables use the BIOMEDARENA_ prefix.

Run one benchmark cell:

biomedarena run \
  --benchmark medcalc \
  --backbone gemini-2.5-flash \
  --tools biomed --reasoning-mode light \
  --limit 5 \
  --output result.json

Run a small matrix cell:

python3 scripts/run_matrix.py \
  --config configs/matrix_default.yaml \
  --only medcalc,gemini,simple_llm \
  --limit-override 1

For the 7-setting quick-start experiment suite used to compare thinking, domain tools, web search, and combined tool use, see quick_run.sh.

Check official source accessibility before spending model budget:

python3 scripts/verify_benchmark_sources.py --benchmarks all

Execution Modes

The public CLI exposes four modes:

Mode	Purpose
`simple_llm`	Pure model baseline, no tools.
`deep_think`	Native model reasoning/thinking path where supported.
`light`	Single-turn function/tool calling.
`heavy`	Multi-turn ReAct loop with tool retrieval.

A unified CLI interface is also available via --tools / --reasoning-mode / --enable-thinking flags, which map to the modes above:

`--tools`	`--reasoning-mode`	Internal mode	Thinking
`off`	(n/a)	`deep_think`	ON (default)
`off` + `--enable-thinking 0`	(n/a)	`simple_llm`	OFF
`biomed` / `search` / `all`	`light`	`light`	OFF
`biomed` / `search` / `all`	`heavy`	`heavy`	ON

The legacy --mode / --web-tools flags remain supported for backward compatibility. Add --self-consistency to wrap any mode with majority voting.

Security

python_exec can execute model-supplied Python with timeout and basic denylist checks. Treat this as a convenience guard, not a hardened sandbox. Run untrusted workloads in an isolated container or VM, keep secrets out of the working directory, and disable code-execution or web-search tools for private data unless you have reviewed the policy.

External tools may call third-party APIs and public databases. Review the benchmark and tool inventories before running sensitive workloads.

Testing

python3 scripts/run_quick_suite.py
python3 scripts/release_gate.py --strict
python3 -m pytest tests/unit -q
HF_HOME=/tmp/biomedarena_hf_empty \
HF_DATASETS_CACHE=/tmp/biomedarena_hf_datasets_empty \
HF_TOKEN= HUGGING_FACE_HUB_TOKEN= HUGGINGFACE_HUB_TOKEN= \
python3 -m pytest tests/smoke -q -m "not slow"

Citation

@article{wu2026biomedarena,
  title={BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents},
  author={Wu, J and Zhou, H and Zeng, M and Zhu, J and Wu, J and Pan, J and Wu, S and Wu, H and Liu, F and Clifton, D A},
  journal={arXiv preprint arXiv:2605.06177},
  year={2026}
}

License

See LICENSE. Ported life-science skill attribution is tracked in harness/tools/openai_ported/NOTICE.md.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
baselines/langgraph_react		baselines/langgraph_react
configs		configs
docker/bixbench		docker/bixbench
docs		docs
harness		harness
scripts		scripts
tests		tests
vendors		vendors
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
config_claude.yaml		config_claude.yaml
config_gemini.yaml		config_gemini.yaml
pyproject.toml		pyproject.toml
quick_run.sh		quick_run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BioMedArena

BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

If you find BioMedArena useful, please give us a star on GitHub for the latest updates.

Overall Performance

Documentation

Quick Check

Installation

Basic Usage

Execution Modes

Security

Testing

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BioMedArena

BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

If you find BioMedArena useful, please give us a star on GitHub for the latest updates.

Overall Performance

Documentation

Quick Check

Installation

Basic Usage

Execution Modes

Security

Testing

Citation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages