GENATATOR-Caduceus-PS (Multispecies Gene Segmentation Model)

Overview

GENATATOR-Caduceus-PS is a DNA language model fine-tuned for gene segmentation directly from genomic DNA sequences.

The model performs nucleotide-level multilabel classification and predicts five gene structure classes:

Class	Description
5UTR	5′ untranslated region
exon	exon
intron	intron
3UTR	3′ untranslated region
CDS	coding sequence

The order of output classes in the model is:

["5UTR", "exon", "intron", "3UTR", "CDS"]

The model outputs one logit vector per nucleotide, allowing reconstruction of gene structures.

Model

Model name on Hugging Face:

genatator-caduceus-ps-multispecies

Architecture properties:

backbone: Caduceus PS
layers: 16
hidden size: 512
tokenization: single nucleotide
output head: linear projection to 5 classes
maximum supported sequence length: 250,000 nucleotides

Training Data

This model was fine-tuned on gene sequences only, not on complete genomes.

Training data includes:

mRNA transcripts
lncRNA transcripts

Dataset characteristics:

one transcript per gene
no intergenic regions
multispecies training dataset

Each training sample corresponds to a single gene sequence.

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification

repo_id = "shmelev/genatator-caduceus-ps-multispecies"

tokenizer = AutoTokenizer.from_pretrained(
    repo_id,
    trust_remote_code=True,
)

model = AutoModelForTokenClassification.from_pretrained(
    repo_id,
    trust_remote_code=True,
)

model.eval()

Example Inference

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

repo_id = "shmelev/genatator-caduceus-ps-multispecies"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForTokenClassification.from_pretrained(repo_id, trust_remote_code=True)

sequences = [
    "ACGTACGTACGTACGTACGTACGTACGT",
    "TTGCGATCGATCGATCGATCGATCGATCGATCGATCGA",
]

enc = tokenizer(sequences)

input_ids = torch.tensor(enc["input_ids"])

with torch.no_grad():
    outputs = model(input_ids=input_ids)

logits = outputs["logits"]

print("Input shape:", input_ids.shape)
print("Logits shape:", logits.shape)

Example output:

Input shape: torch.Size([2, sequence_length])
Logits shape: torch.Size([2, sequence_length, 5])

Each nucleotide receives 5 logits corresponding to the gene structure classes.

Downloads last month: 13

Safetensors

Model size

14M params

Tensor type

I64

F32