DeepChopper: Chimera Detection for Nanopore Direct RNA Sequencing

DeepChopper is a genomic language model designed to accurately detect and remove chimera artifacts in Nanopore direct RNA sequencing data. It uses a HyenaDNA backbone with a token classification head to identify artificial adapter sequences within reads.

Model Details

Model Description

DeepChopper leverages the HyenaDNA-small-32k backbone, a genomic foundation model, combined with a specialized token classification head to detect chimeric artifacts in nanopore direct RNA sequencing reads. The model processes both sequence information and base quality scores to make accurate predictions.

Model Architecture

  • Backbone: HyenaDNA-small-32k (256 dimensions)
  • Classification Head:
    • Linear Layer 1: 256 โ†’ 1024 dimensions
    • Linear Layer 2: 1024 โ†’ 1024 dimensions
    • Output Layer: 1024 โ†’ 2 classes (artifact/non-artifact)
    • Quality Score Integration: Identity layer for base quality incorporation
  • Input:
    • Tokenized DNA sequences (vocabulary size: 12)
    • Base quality scores
  • Output: Per-base classification (artifact vs. non-artifact)

Uses

Direct Use

DeepChopper is designed for:

  • Detecting chimeric artifacts in Nanopore direct RNA sequencing data
  • Identifying adapter sequences within base-called reads
  • Preprocessing RNA-seq data before downstream transcriptomics analysis
  • Improving accuracy of transcript annotation and gene fusion detection

Downstream Use

The cleaned data can be used for:

  • Transcript isoform analysis
  • Gene expression quantification
  • Novel transcript discovery
  • Gene fusion detection
  • Alternative splicing analysis

Out-of-Scope Use

This model is NOT designed for:

  • DNA sequencing data (it's specifically trained on RNA sequences)
  • PacBio or Illumina sequencing platforms
  • Genome assembly or variant calling

Training Details

Training Data

The model was trained on Nanopore direct RNA sequencing data with manually curated annotations of chimeric artifacts and adapter sequences.

Training Procedure

  • Optimizer: Adam (lr=0.0002, weight_decay=0)
  • Learning Rate Scheduler: ReduceLROnPlateau (mode=min, factor=0.1, patience=10)
  • Loss Function: Continuous Interval Loss (CrossEntropyLoss with no penalty)
  • Framework: PyTorch Lightning

Training Hyperparameters

  • Learning Rate: 0.0002
  • Batch Size: Configured per experiment
  • Weight Decay: 0
  • Backbone: Fine-tuned (not frozen)

Evaluation

Testing Data & Metrics

The model is evaluated on held-out test sets using:

  • F1 Score (primary metric)
  • Precision
  • Recall

Results

DeepChopper significantly improves downstream analysis quality by accurately removing chimeric artifacts that would otherwise confound transcriptome analyses.

How to Use

Installation

pip install deepchopper

Python API

import deepchopper

# Load the pretrained model
model = deepchopper.DeepChopper.from_pretrained("yangliz5/deepchopper-rna004")

# The model is ready for inference
# Use with deepchopper's predict pipeline

Command Line Interface

# Step 1: Encode your FASTQ data
deepchopper encode input.fq

# Step 2: Predict chimeric artifacts
deepchopper predict input.parquet --output predictions

# Step 3: Remove artifacts and generate clean FASTQ
deepchopper chop predictions input.fq

For GPU acceleration:

deepchopper predict input.parquet --output predictions --gpus 1

Web Interface

Try DeepChopper online without installation:

Limitations

  • Platform-specific: Optimized for Nanopore direct RNA sequencing
  • Read length: Best performance on reads up to 32k bases (model context window)
  • Species: Trained primarily on human RNA sequences
  • Computational requirements: GPU recommended for large datasets

Citation

If you use DeepChopper in your research, please cite:

@article{Li2024.10.23.619929,
    author = {Li, Yangyang and Wang, Ting-You and Guo, Qingxiang and Ren, Yanan and Lu, Xiaotong and Cao, Qi and Yang, Rendong},
    title = {A Genomic Language Model for Chimera Artifact Detection in Nanopore Direct RNA Sequencing},
    year = {2024},
    doi = {10.1101/2024.10.23.619929},
    journal = {bioRxiv}
}

Contact & Support

Model Card Authors

YLab Team

Model Card Contact

For questions about this model, please open an issue on the GitHub repository.

Downloads last month
61
Safetensors
Model size
5.38M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using yangliz5/deepchopper 1