Nodule-AI—A deep learning model for Nodule-Specific Gene Identification
Model Description
This model is a fine-tuned version of the zhangtaolab/plant-dnamamba-BPE architecture specialized for identifying nodule-specific genes based on promoter DNA sequences. The base model was pretrained on plant genomic sequences using a Mamba-based architecture with Byte Pair Encoding (BPE), which we've adapted for promoter analysis through targeted fine-tuning.
How to Use
NVIDIA GPU is required
Installation
conda create -n llms python=3.11
conda activate llms
pip install 'torch<2.4' 'mambapy<=1.2.0' 'transformers<4.46' 'causal-conv1d<=1.3' 'mamba-ssm<2'
It may take ~15 min for a fresh install
Basic Inference
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "lhui2010/nodule-AI"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True)
# Prepare input (3000bp promoter sequence)
promoter_sequence = "ATGCGTCTCA"*300 # your promoter here
# Tokenize and predict
inputs = tokenizer(
promoter_sequence,
return_tensors="pt",
max_length=3000,
truncation=True,
padding="max_length"
)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
nodule_prob = probs[0][1].item()
print(f"Probability of nodule-specific regulation: {nodule_prob:.4f}")
The output should be like
Probability of nodule-specific regulation: 0.0021
Calculation of Shapley scores
Training Data
The model was fine-tuned on a large dataset of plant promoter sequences with nodule-induced genes compiled from 14 plant genomes from the nitrogen-fixing clade:
| Data Category | Samples | Species Included |
|---|---|---|
| Nodule-specific promoters | 175,365 | Aeschynomene evenia, Alnus trabeculosa, Arachis hypogaea, Chamaecrista pumila, Coriaria nepalensis, Datisca glomerata, Elaeagnus umbellata, Glycine max, Hippophae rhamnoides, Lotus japonicus, Medicago truncatula, Mimosa pudica, Parasponia andersonii, Phaseolus vulgaris |
| Non-nodule promoters | 170,912 | Matching species background sets |
Training Procedure
Fine-tuning Parameters:
- Epochs: 5
- Batch size: 8
- Learning rate: 1e-5
- Hardware: 1 × Tesla V100 32GB GPU
Evaluation
Performance on evaluation set (n=43285 sequences):
| Metric | Value |
|---|---|
| Accuracy | 0.90 |
| F1 Score | 0.90 |
| Precision | 0.85 |
| Recall | 0.96 |
| Matthews correlation | 0.80 |
Citation
Model card last updated: July 12, 2025
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support