OPT identifies potential off-target binding of probe sequences against a reference transcriptome using nucleotide alignment (nucmer). The goal of OPT is to help evaluate probe specificity before experiments by detecting probes that may hybridize to unintended transcripts.
Hallinan et al., eLife 2025. https://elifesciences.org/reviewed-preprints/107070
OPT can be used via a point-and-click web interface (recommended) or directly from the command line.
OPT has been tested on Linux and macOS. The fastest way to install is with the provided script:
git clone https://github.com/JEFworks-Lab/off-target-probe-tracker.git
cd off-target-probe-tracker
bash install.shThis script will:
- Create a conda environment named
optfromenvironment.yml - Install mummer4 (via conda on Linux, via Homebrew on macOS)
- Install the
optPython package - Decompress examples probes and all reference annotations
conda create --name opt pip python=3.9
conda activate opt
conda config --add channels bioconda
conda config --add channels conda-forge
conda install gffread bowtie2 samtools mummer4
git clone https://github.com/JEFworks-Lab/off-target-probe-tracker.git
cd off-target-probe-tracker
pip install .conda create --name opt pip python=3.9
conda activate opt
conda config --add channels bioconda
conda config --add channels conda-forge
conda install gffread bowtie2 samtools
git clone https://github.com/JEFworks-Lab/off-target-probe-tracker.git
cd off-target-probe-tracker
pip install .mummer4 must be installed via Homebrew on macOS (conda does not support it):
brew install autoconf automake libtool md5sha1sum
gem install yaggo
brew install mummerNote: mummer version >= 4.0.1 is required. Run
mummer -hto confirm a successful install.
The easiest way to use OPT is through the interactive web app:
conda activate opt
streamlit run app.pyThen open http://localhost:8501 in your browser.
- Run Configuration — set the output directory and number of threads.
- Input Files — provide paths to your probe FASTA and reference files. Use the Browse buttons or type paths directly.
- Select an annotation format preset: GENCODE, RefSeq, CHESS, or Other (custom schema). The correct GFF/GTF schema is applied automatically.
- Select All (GENCODE + CHESS + RefSeq) to run against all three reference annotations sequentially and merge the results into a unified off-target table.
- An optional gene synonyms CSV can be provided to remap gene names that differ between your probe FASTA and the reference (e.g.
WARS→WARS1).
- Analysis Options:
- Pad length — number of bases at each probe end where mismatches are tolerated (default: off). Used in the original paper for Xenium probes, which are circular and can tolerate terminal mismatches.
- Max mismatches anywhere — allow up to N mismatches anywhere in the full probe sequence (default: off). Can be combined with pad length: when both are set, both conditions must be satisfied.
- Click Run OPT to run all three modules (flip → track → stat) and view results in the dashboard below.
The results dashboard shows:
- Brief Summary - total genes with off-target binding, total genes with protein-coding off-targets, total probes with off-target binding
- Gene-level off-target table — one row per target gene → off-target gene pair, with biotype badges, CIGAR strings, and source annotation. Filterable by biotype and sortable by any column.
- Probe-level detail table (expandable) — one row per probe, showing off-target genes, biotypes, and CIGAR strings (consistent counts,
|-delimited). - Download buttons for all key output files.
To load results from a previous run without re-running OPT, set the Output Directory to your previous run folder and click Load previous results.
OPT consists of three modules — flip, track, stat — plus an all module that runs all three in sequence.
opt -o out_dir all -q probes.fa -t transcripts.fa -a transcripts.gff| Argument | Description |
|---|---|
-o, --out-dir |
Output directory (required) |
-p, --threads |
Number of threads (default: 1) |
--bam |
Store alignments as BAM instead of SAM |
-l, --min-exact-match |
Minimum exact match length for nucmer (default: 20) |
--schema |
Comma-separated list of 5 GFF/GTF schema fields (see below) |
--keep-dot |
Keep version suffixes in gene IDs (e.g. ENSG00000.1) |
--force |
Recompute all steps, ignoring any cached results |
--skip-index |
Skip Bowtie2 index build step |
opt -o out_dir flip -q probes.fa -t transcripts.fa -a transcripts.gffProbes are expected to be on the same strand as their target gene. flip detects probes that align to the reverse complement of their target and flips them. Output: fwd_oriented.fa.
opt -o out_dir track -q fwd_oriented.fa -t transcripts.fa -a transcripts.gff| Argument | Description |
|---|---|
-q, --query |
Query probe FASTA (required) |
-t, --target |
Target transcript FASTA (required) |
-a, --annotation |
Annotation GFF/GTF (required) |
-pl, --pad-length |
Tolerate mismatches in the terminal N bases of each probe end |
-mm, --max-mismatches |
Allow up to N mismatches anywhere in the full probe (-1 = disabled) |
-1, --one-mismatch |
Allow up to 1 mismatch using mummer exact-match extension |
Output: probe2targets.tsv (all probes) and probe2targets_offtargets.tsv (probes mapping to >1 gene).
opt -o out_dir stat -i probe2targets.tsv -q probes.fa| Argument | Description |
|---|---|
-i, --in-file |
probe2targets.tsv from the track module (required) |
-q, --query |
Query probe FASTA (required) |
--exclude-pseudo |
Exclude pseudogenes from off-target counts |
--pc-only |
Count only protein-coding genes as off-targets |
-s, --syn-file |
Gene synonyms CSV (two columns: old name, new name) |
Headers must follow this format:
>gene_id|gene_name|unique_id
Example:
>ENSG00000170458|CD14|22f9405
ATCGATCGATCGATCGATCG...
Standard nucleotide FASTA of transcript sequences (.fa or .fasta). We recommend extracting these with gffread:
gffread -w transcripts.fa -g genome.fa annotation.gffNote: The web app requires uncompressed
.faor.fastafiles. The CLI accepts any format that nucmer/Bowtie2 supports.
Standard GFF3 or GTF format (.gff, .gff3, or .gtf). GENCODE, RefSeq, and CHESS formats are all supported. Select the matching preset in the web app, or use --schema on the command line for non-standard formats.
Note: The web app requires uncompressed annotation files. Gzip-compressed files (
.gz) are supported via the CLI only.
Two-column CSV mapping probe gene names to annotation gene names. No header required:
WARS,WARS1
CARS,CARS1
Use this when gene names in your probe FASTA differ from those in the reference annotation.
The --schema argument specifies five comma-separated field names used to parse the annotation. Built-in presets for common formats:
| Format | Schema string |
|---|---|
| GENCODE GFF | transcript,ID,Parent,gene_name,transcript_type |
| RefSeq GFF | transcript,ID,Parent,gene,gbkey |
| CHESS GFF | transcript,ID,Parent,gene_name,gene_type |
| GTF (general) | transcript,transcript_id,gene_id,gene_name,transcript_type |
| Position | Description |
|---|---|
| 1 | Feature type (column 3 of the GFF/GTF) |
| 2 | Transcript ID attribute |
| 3 | Parent gene attribute |
| 4 | Gene name attribute |
| 5 | Transcript type / biotype attribute |
If you are unsure which schema to use, open a GitHub issue.
| File | Description |
|---|---|
fwd_oriented.fa |
Strand-corrected probe sequences (from flip) |
flip_t2g.csv |
Transcript-to-gene map built during flip |
probe2targets.tsv |
All probe alignments with gene and CIGAR info |
probe2targets_offtargets.tsv |
Probes mapping to more than one gene |
collapsed_summary.tsv |
Per-gene summary of all probe alignments |
collapsed_summary_offtargets.tsv |
Per-gene summary for off-target genes only |
stat_off_target_probes.txt |
List of off-target probe IDs |
stat_off_target_genes.txt |
List of off-target gene names |
stat_missed_probes.txt |
Probes that did not align to their target gene |
stat_missed_genes.txt |
Target genes with no aligned probes |
track.unmapped.txt |
Probes with no alignments |
track.no_hit.txt |
Probes that aligned but passed no acceptance threshold |
When running in All (GENCODE + CHESS + RefSeq) mode, each annotation runs in its own subdirectory (gencode/, chess/, refseq/) and results are merged into the base output directory with an added reference_annotation column.
The data/ directory includes pre-formatted reference files for human (GRCh38):
| Source | Files |
|---|---|
| GENCODE v47 | data/gencode/gencode.v47.basic.annotation.fmted.fa / .gff |
| RefSeq v110 | data/refseq/refseq.v110.noAlt.noFix.filtered.fa.gz / .gff.gz |
| CHESS 3.1.3 | data/chess/chess3.1.3.GRCh38.primary.fa.gz / .gff.gz |
An example gene synonyms file is at data/gene_synonyms.csv.
The install.sh script automatically decompresses all .gz files in the data/ directory. To decompress manually:
find data/ -name "*.gz" -exec gunzip -k {} \;- Linux (tested)
- macOS (tested; mummer4 requires Homebrew)
See LICENSE.md.