DeepChopper: Chimera Detection for Nanopore Direct RNA Sequencing

DeepChopper is a genomic language model designed to accurately detect and remove chimera artifacts in Nanopore direct RNA sequencing data. It uses a HyenaDNA backbone with a token classification head to identify artificial adapter sequences within reads.

Model Details

Model Description

DeepChopper leverages the HyenaDNA-small-32k backbone, a genomic foundation model, combined with a specialized token classification head to detect chimeric artifacts in nanopore direct RNA sequencing reads. The model processes both sequence information and base quality scores to make accurate predictions.

Developed by: YLab Team (Li et al., 2024)
Model type: Token Classification
Language(s): DNA sequences
License: Apache 2.0
Base Model: HyenaDNA-small-32k-seqlen
Repository: DeepChopper GitHub
Paper: A Genomic Language Model for Chimera Artifact Detection

Model Architecture

Backbone: HyenaDNA-small-32k (256 dimensions)
Classification Head:
- Linear Layer 1: 256 → 1024 dimensions
- Linear Layer 2: 1024 → 1024 dimensions
- Output Layer: 1024 → 2 classes (artifact/non-artifact)
- Quality Score Integration: Identity layer for base quality incorporation
Input:
- Tokenized DNA sequences (vocabulary size: 12)
- Base quality scores
Output: Per-base classification (artifact vs. non-artifact)

Uses

Direct Use

DeepChopper is designed for:

Detecting chimeric artifacts in Nanopore direct RNA sequencing data
Identifying adapter sequences within base-called reads
Preprocessing RNA-seq data before downstream transcriptomics analysis
Improving accuracy of transcript annotation and gene fusion detection

Downstream Use

The cleaned data can be used for:

Transcript isoform analysis
Gene expression quantification
Novel transcript discovery
Gene fusion detection
Alternative splicing analysis

Out-of-Scope Use

This model is NOT designed for:

DNA sequencing data (it's specifically trained on RNA sequences)
PacBio or Illumina sequencing platforms
Genome assembly or variant calling

Training Details

Training Data

The model was trained on Nanopore direct RNA sequencing data with manually curated annotations of chimeric artifacts and adapter sequences.

Training Procedure

Optimizer: Adam (lr=0.0002, weight_decay=0)
Learning Rate Scheduler: ReduceLROnPlateau (mode=min, factor=0.1, patience=10)
Loss Function: Continuous Interval Loss (CrossEntropyLoss with no penalty)
Framework: PyTorch Lightning

Training Hyperparameters

Learning Rate: 0.0002
Batch Size: Configured per experiment
Weight Decay: 0
Backbone: Fine-tuned (not frozen)

Evaluation

Testing Data & Metrics

The model is evaluated on held-out test sets using:

F1 Score (primary metric)
Precision
Recall

Results

DeepChopper significantly improves downstream analysis quality by accurately removing chimeric artifacts that would otherwise confound transcriptome analyses.

How to Use

Installation

pip install deepchopper

Python API

import deepchopper

# Load the pretrained model
model = deepchopper.DeepChopper.from_pretrained("yangliz5/deepchopper-rna004")

# The model is ready for inference
# Use with deepchopper's predict pipeline

Command Line Interface

# Step 1: Encode your FASTQ data
deepchopper encode input.fq

# Step 2: Predict chimeric artifacts
deepchopper predict input.parquet --output predictions

# Step 3: Remove artifacts and generate clean FASTQ
deepchopper chop predictions input.fq

For GPU acceleration:

deepchopper predict input.parquet --output predictions --gpus 1

Web Interface

Try DeepChopper online without installation:

Hugging Face Space
Or run locally: deepchopper web

Limitations

Platform-specific: Optimized for Nanopore direct RNA sequencing
Read length: Best performance on reads up to 32k bases (model context window)
Species: Trained primarily on human RNA sequences
Computational requirements: GPU recommended for large datasets

Citation

If you use DeepChopper in your research, please cite:

@article{Li2024.10.23.619929,
    author = {Li, Yangyang and Wang, Ting-You and Guo, Qingxiang and Ren, Yanan and Lu, Xiaotong and Cao, Qi and Yang, Rendong},
    title = {A Genomic Language Model for Chimera Artifact Detection in Nanopore Direct RNA Sequencing},
    year = {2024},
    doi = {10.1101/2024.10.23.619929},
    journal = {bioRxiv}
}

Contact & Support

Issues: GitHub Issues
Documentation: Full Tutorial
Repository: GitHub

Model Card Authors

YLab Team

Model Card Contact

For questions about this model, please open an issue on the GitHub repository.

Downloads last month: 61

Safetensors

Model size

5.38M params

Tensor type

F32

yangliz5
/

deepchopper