DeepChopper: Chimera Detection for Nanopore Direct RNA Sequencing
DeepChopper is a genomic language model designed to accurately detect and remove chimera artifacts in Nanopore direct RNA sequencing data. It uses a HyenaDNA backbone with a token classification head to identify artificial adapter sequences within reads.
Model Details
Model Description
DeepChopper leverages the HyenaDNA-small-32k backbone, a genomic foundation model, combined with a specialized token classification head to detect chimeric artifacts in nanopore direct RNA sequencing reads. The model processes both sequence information and base quality scores to make accurate predictions.
- Developed by: YLab Team (Li et al., 2024)
- Model type: Token Classification
- Language(s): DNA sequences
- License: Apache 2.0
- Base Model: HyenaDNA-small-32k-seqlen
- Repository: DeepChopper GitHub
- Paper: A Genomic Language Model for Chimera Artifact Detection
Model Architecture
- Backbone: HyenaDNA-small-32k (256 dimensions)
- Classification Head:
- Linear Layer 1: 256 โ 1024 dimensions
- Linear Layer 2: 1024 โ 1024 dimensions
- Output Layer: 1024 โ 2 classes (artifact/non-artifact)
- Quality Score Integration: Identity layer for base quality incorporation
- Input:
- Tokenized DNA sequences (vocabulary size: 12)
- Base quality scores
- Output: Per-base classification (artifact vs. non-artifact)
Uses
Direct Use
DeepChopper is designed for:
- Detecting chimeric artifacts in Nanopore direct RNA sequencing data
- Identifying adapter sequences within base-called reads
- Preprocessing RNA-seq data before downstream transcriptomics analysis
- Improving accuracy of transcript annotation and gene fusion detection
Downstream Use
The cleaned data can be used for:
- Transcript isoform analysis
- Gene expression quantification
- Novel transcript discovery
- Gene fusion detection
- Alternative splicing analysis
Out-of-Scope Use
This model is NOT designed for:
- DNA sequencing data (it's specifically trained on RNA sequences)
- PacBio or Illumina sequencing platforms
- Genome assembly or variant calling
Training Details
Training Data
The model was trained on Nanopore direct RNA sequencing data with manually curated annotations of chimeric artifacts and adapter sequences.
Training Procedure
- Optimizer: Adam (lr=0.0002, weight_decay=0)
- Learning Rate Scheduler: ReduceLROnPlateau (mode=min, factor=0.1, patience=10)
- Loss Function: Continuous Interval Loss (CrossEntropyLoss with no penalty)
- Framework: PyTorch Lightning
Training Hyperparameters
- Learning Rate: 0.0002
- Batch Size: Configured per experiment
- Weight Decay: 0
- Backbone: Fine-tuned (not frozen)
Evaluation
Testing Data & Metrics
The model is evaluated on held-out test sets using:
- F1 Score (primary metric)
- Precision
- Recall
Results
DeepChopper significantly improves downstream analysis quality by accurately removing chimeric artifacts that would otherwise confound transcriptome analyses.
How to Use
Installation
pip install deepchopper
Python API
import deepchopper
# Load the pretrained model
model = deepchopper.DeepChopper.from_pretrained("yangliz5/deepchopper-rna004")
# The model is ready for inference
# Use with deepchopper's predict pipeline
Command Line Interface
# Step 1: Encode your FASTQ data
deepchopper encode input.fq
# Step 2: Predict chimeric artifacts
deepchopper predict input.parquet --output predictions
# Step 3: Remove artifacts and generate clean FASTQ
deepchopper chop predictions input.fq
For GPU acceleration:
deepchopper predict input.parquet --output predictions --gpus 1
Web Interface
Try DeepChopper online without installation:
- Hugging Face Space
- Or run locally:
deepchopper web
Limitations
- Platform-specific: Optimized for Nanopore direct RNA sequencing
- Read length: Best performance on reads up to 32k bases (model context window)
- Species: Trained primarily on human RNA sequences
- Computational requirements: GPU recommended for large datasets
Citation
If you use DeepChopper in your research, please cite:
@article{Li2024.10.23.619929,
author = {Li, Yangyang and Wang, Ting-You and Guo, Qingxiang and Ren, Yanan and Lu, Xiaotong and Cao, Qi and Yang, Rendong},
title = {A Genomic Language Model for Chimera Artifact Detection in Nanopore Direct RNA Sequencing},
year = {2024},
doi = {10.1101/2024.10.23.619929},
journal = {bioRxiv}
}
Contact & Support
- Issues: GitHub Issues
- Documentation: Full Tutorial
- Repository: GitHub
Model Card Authors
YLab Team
Model Card Contact
For questions about this model, please open an issue on the GitHub repository.
- Downloads last month
- 61