Skip to content

DLii-Research/setbert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The official code repository for SetBERT: SetBERT: the deep learning platform for contextualized embeddings and explainable predictions from high-throughput sequencing

Graphical Abstract

Quick Start

Installation from PyPI:

pip install dbtk-setbert

Download SetBERT pre-trained on the Qiita 16S platform (see Available Models for other options):

from setbert import SetBert

# Download the model
model = SetBert.from_pretrained("sirdavidludwig/setbert", revision="qiita-16s")

# Get the tokenizer
tokenizer = model.sequence_encoder.tokenizer

Example sample embedding

import torch

# Input sample
sequences = [
    "ACTGCAG",
    "TGACGTA",
    "ATGACGA"
]

# Tokenize sequences in the sample
sequence_tokens = torch.stack([torch.tensor(tokenizer(s)) for s in sequences])

# Compute embeddings
output = model(sequence_tokens)

# Sample level representation
sample_embedding = output["class"]

# Contextualized sequence representations
sequence_embeddings = output["sequences"]

Available Models:

Model Revision Platform Pre-training Dataset Description
qiita-16s 16S Amplicon ~280k 16S amplicon samples from the Qiita platform

Configuration

SetBERT embeds the DNA sequences in chunks using activation checkpointing. This chunk size is specified by the sequence_encoder_chunk_size parameter in the SetBert.Config class and adjusted freely at any point.

# Set chunk size
model.config.sequence_encoder_chunk_size = 256 # default

# Remove chunking and embed all sequences in parallel
model.config.sequence_encoder_chunk_size = None

Manual Installation

git clone https://github.com/DLii-Research/setbert
pip install -e ./setbert

Citation

@article{ludwig_setbert_2025,
	title = {{SetBERT}: the deep learning platform for contextualized embeddings and explainable predictions from high-throughput sequencing},
	volume = {41},
	issn = {1367-4811},
	doi = {10.1093/bioinformatics/btaf370},
	number = {7},
	journal = {Bioinformatics},
	author = {Ludwig, II, David W and Guptil, Christopher and Alexander, Nicholas R and Zhalnina, Kateryna and Wipf, Edi M -L and Khasanova, Albina and Barber, Nicholas A and Swingley, Wesley and Walker, Donald M and Phillips, Joshua L},
	month = jul,
	year = {2025},
}

Original Experiment Source Code

The original source code used to produce the models and experiments for the manuscript are available in the bioinformatics branch of this repository.

About

A repository of deep learning models for DNA samples and sequences

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •