BioSpecDec

BioSpecDec is a benchmarking and analysis toolkit for speculative decoding on genomic and protein language models (DNAGPT, ProGen2, ProtGPT2). It provides multi-GPU sweep runners for large hyperparameter grids, offline likelihood/perplexity scoring to compare speculative vs. baseline decoding, and utilities to compute grouped statistics and generate paper-ready figures. The code is organized so you can easily reproduce the experiments from our manuscript, run new sweeps on your own prompts, or perform quick one-prompt sanity checks for any supported model.

This repository contains:

Multi-GPU sweep runners (sharded across GPUs) that write wide CSVs.
Offline likelihood/perplexity scoring that augments those CSVs.
Grouped statistics + plotting scripts used to generate paper-ready figures.

The “current implementation” is:

All wrappers in scripts/
dnagpt_hg38_sweep.py (DNAGPT hg38 sweep runner)
likelihood_scoring.py (DNAGPT CSV scoring)
scoring_specdec_beam_search.py (core benchmark runner used by the sweep)
pipeline/ (protein/progen2/protgpt2 utilities)

0) Environment (venv)

All shell scripts call python directly, so you must activate an environment first.

Create venv

cd /path/to/BioSpecDec

python -m venv .venv
source .venv/bin/activate

python -m pip install --upgrade pip
pip install -r requirements.txt

Install PyTorch (pick ONE option)

PyTorch is not pinned in requirements.txt because the correct build depends on your machine.

CPU-only:

pip install torch

CUDA (example for CUDA 12.1 wheels):

pip install torch --index-url https://download.pytorch.org/whl/cu121

Notes:

If you want ProGen2 model-sharding with --device_map auto, you may need pip install accelerate (already included in requirements.txt).

Optional: conda

If you prefer conda, an example environment file is provided in environment.yml.

1) DNAGPT dependency (code + weights)

You need the DNAGPT repository and checkpoints available relative to this folder. Clone the upstream repo.

git clone https://github.com/TencentAILabHealthcare/DNAGPT.git

Place it at ./DNAGPT (same level as the scripts here) and ensure checkpoints exist at:

DNAGPT/checkpoints/dna_gpt0.1b_h.pth
DNAGPT/checkpoints/dna_gpt0.1b_m.pth
DNAGPT/checkpoints/dna_gpt3b_m.pth

To download these checkpoints with gdown (from Google Drive):

cd DNAGPT
mkdir -p checkpoints
cd checkpoints

pip install gdown
gdown --id 15m6CH3zaMSqflOaf6ec5VPfiulg-Gh0u -O dna_gpt0.1b_h.pth
gdown --id 1C0BRXfz7RNtCSjSY1dKQeR1yP7I3wTyx -O dna_gpt0.1b_m.pth
gdown --id 1pQ3Ai7C-ObzKkKTRwuf6eshVneKHzYEg -O dna_gpt3b_m.pth

The DNAGPT runner imports DNAGPT via the repo-local path (see the code in scoring_specdec_beam_search.py).

2) hg38 input file (hg38_sequences.csv)

The DNAGPT hg38 sweep reads prompts from:

hg38_sequences.csv

Expected CSV columns:

seq (required): raw DNA sequence (bases)
id (optional): unique identifier used as prompt_id/hg_id in outputs
chrom, start, end (optional): genomic coordinates; included in output for provenance

How it is used:

dnagpt_hg38_sweep.py reads rows and constructs a prompt string: prompt_text = hg_prefix + seq
hg_prefix defaults to "" and is configurable via the sweep JSON.
The prompt is tokenized and optionally truncated to max_prompt_tokens.

3) DNAGPT hg38 sweep (multi-GPU)

Primary script you used:

bash scripts/run_dna_specdec_multigpu_and_score.sh \
  sweeps/dnagpt/dnagpt_hg38_sweep.json \
  results/dnagpt_hg38_wide.csv

What this does:

Runs a sharded sweep across GPUs via scripts/run_dna_specdec_multigpu.sh.
Merges the shard CSVs into a single wide CSV.
Runs DNAGPT likelihood scoring on the merged wide CSV and writes results/dnagpt_hg38_wide_scored.csv.

GPU selection:

If CUDA_VISIBLE_DEVICES is set, the wrapper uses those GPUs.
Otherwise it uses torch.cuda.device_count().

Sweep configs (JSON)

The DNAGPT sweep config is a JSON file with three main sections:

(Some people informally call these “YAML sweep files”; in this repo they are JSON.)
base: fixed parameters (models, weights, device/dtype, etc.)
grid: lists of hyperparameters to sweep (cartesian product)
prompt_source: how prompts are sampled from hg38_sequences.csv

Relevant prompt_source keys (see dnagpt_hg38_sweep.py):

hg_csv (default: hg38_sequences.csv)
hg_prefix (default: )
num_prompts
hg_row_indices (explicit row indices) or hg_ids (explicit ids)
seed

4) Protein sweeps (ProGen2 / ProtGPT2) (multi-GPU)

Protein sweeps run through pipeline/ and are launched with:

bash scripts/run_protein_specdec_multigpu.sh \
  sweeps/protein/protgpt2_speed_sweep.json \
  "" \
  results/protgpt2_wide.csv

To run sweep + likelihood scoring:

bash scripts/run_protein_specdec_multigpu_and_score.sh \
  sweeps/protein/protgpt2_speed_sweep.json \
  results/protgpt2_wide.csv \
  protgpt2 nferruz/ProtGPT2

5) Grouped statistics + plots

After you have wide CSVs (optionally scored), compute grouped stats:

bash scripts/run_compute_statistics.sh \
  results/dnagpt_hg38_wide_scored.csv \
  results/protgpt2_wide_scored.csv

Then generate plots:

bash scripts/run_plot_statistics.sh

6) One-prompt runs (instant sanity checks)

One-prompt runs use the unified runner in pipeline/run_generate.py.

It supports three model families:

protgpt2 (HuggingFace)
progen2 (HuggingFace, trust_remote_code)
dnagpt (local DNAGPT repo + checkpoint .pth files)

The unified CLI always writes:

output_jsonl: per-sample records (prompt ids, generated ids, speed, acceptance rate)
output_csv: one-row summary of averages

Protein one prompt (ProtGPT2 / ProGen2)

Convenience wrapper:

bash scripts/run_protein_one.sh protgpt2 "<|endoftext|>M" \
  results/protgpt2_one.jsonl results/protgpt2_one_summary.csv

Direct CLI (same underlying runner):

python -m pipeline.run_generate \
  --model_family protgpt2 \
  --method specdec \
  --prompt "<|endoftext|>M" \
  --num_tokens 256 \
  --num_samples 1 \
  --temperature 1.0 \
  --top_k 0 \
  --top_p 0.0 \
  --gamma 5 \
  --accept_mode prob \
  --target_model_name nferruz/ProtGPT2 \
  --draft_model_name "(truncated)" \
  --draft_layers 6 \
  --device cuda \
  --dtype float16 \
  --output_jsonl results/protgpt2_one.jsonl \
  --output_csv results/protgpt2_one_summary.csv

For ProGen2, set model_family/progen2 names, and optionally --device_map auto if you want HF sharding.

DNAGPT one prompt (custom prompt + temperature)

Convenience wrapper:

TEMPERATURE=1.0 TOPP=0.95 NUM_TOKENS=256 METHOD=specdec bash scripts/run_dna_generate.sh \
  "<R>ACGTACGTACGT" \
  results/dnagpt_one.jsonl results/dnagpt_one_summary.csv

Direct CLI:

python -m pipeline.run_generate \
  --model_family dnagpt \
  --method specdec \
  --prompt "<R>ACGTACGTACGT" \
  --num_tokens 256 \
  --num_samples 1 \
  --temperature 1.0 \
  --top_k 0 \
  --top_p 0.95 \
  --gamma 5 \
  --accept_mode prob \
  --target_model_name dna_gpt3b_m \
  --draft_model_name dna_gpt0.1b_m \
  --target_weight DNAGPT/checkpoints/dna_gpt3b_m.pth \
  --draft_weight DNAGPT/checkpoints/dna_gpt0.1b_m.pth \
  --device cuda \
  --dtype float16 \
  --output_jsonl results/dnagpt_one.jsonl \
  --output_csv results/dnagpt_one_summary.csv

7) Layer selection ablation

Compares three layer selection strategies for constructing truncated draft models: First-N (bottom layers), Last-N (top layers), and Mixed (evenly-spaced). Evaluated on ProtGPT2 and ProGen2-xlarge across 3, 4, and 6 draft layers.

Run the experiment

python scripts/layer_selection_experiment.py --model both --device cuda

This produces per-sample CSVs in results/layer_selection/.

Generate summary + plots

python scripts/create_layer_selection_summary.py
python scripts/plot_layer_selection.py

Outputs:

results/layer_selection/layer_selection_summary.csv — mean ± std per (model, strategy, n_draft_layers)
results/layer_selection_plots/layer_strategy_comparison.png — bar chart (average across layer counts)
results/layer_selection_plots/layer_strategy_lines.png — line plot (by number of layers)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BioSpecDec

0) Environment (venv)

Create venv

Install PyTorch (pick ONE option)

Optional: conda

1) DNAGPT dependency (code + weights)

2) hg38 input file (hg38_sequences.csv)

3) DNAGPT hg38 sweep (multi-GPU)

Sweep configs (JSON)

4) Protein sweeps (ProGen2 / ProtGPT2) (multi-GPU)

5) Grouped statistics + plots

6) One-prompt runs (instant sanity checks)

Protein one prompt (ProtGPT2 / ProGen2)

DNAGPT one prompt (custom prompt + temperature)

7) Layer selection ablation

Run the experiment

Generate summary + plots

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
DNAGPT		DNAGPT
pipeline		pipeline
scripts		scripts
sweeps		sweeps
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dnagpt_hg38_sweep.py		dnagpt_hg38_sweep.py
environment.yml		environment.yml
hg38_sequences.csv		hg38_sequences.csv
likelihood_scoring.py		likelihood_scoring.py
requirements.txt		requirements.txt
scoring_specdec_beam_search.py		scoring_specdec_beam_search.py
specdec_progen2_truncated.py		specdec_progen2_truncated.py
specdec_protein.py		specdec_protein.py

Folders and files

Latest commit

History

Repository files navigation

BioSpecDec

0) Environment (venv)

Create venv

Install PyTorch (pick ONE option)

Optional: conda

1) DNAGPT dependency (code + weights)

2) hg38 input file (hg38_sequences.csv)

3) DNAGPT hg38 sweep (multi-GPU)

Sweep configs (JSON)

4) Protein sweeps (ProGen2 / ProtGPT2) (multi-GPU)

5) Grouped statistics + plots

6) One-prompt runs (instant sanity checks)

Protein one prompt (ProtGPT2 / ProGen2)

DNAGPT one prompt (custom prompt + temperature)

7) Layer selection ablation

Run the experiment

Generate summary + plots

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages