Nodule-AI—A deep learning model for Nodule-Specific Gene Identification

Model Description

This model is a fine-tuned version of the zhangtaolab/plant-dnamamba-BPE architecture specialized for identifying nodule-specific genes based on promoter DNA sequences. The base model was pretrained on plant genomic sequences using a Mamba-based architecture with Byte Pair Encoding (BPE), which we've adapted for promoter analysis through targeted fine-tuning.

How to Use

NVIDIA GPU is required

Installation

conda create -n llms python=3.11
conda activate llms
pip install  'torch<2.4'  'mambapy<=1.2.0' 'transformers<4.46' 'causal-conv1d<=1.3' 'mamba-ssm<2'

It may take ~15 min for a fresh install

Basic Inference

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "lhui2010/nodule-AI"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True)

# Prepare input (3000bp promoter sequence)
promoter_sequence = "ATGCGTCTCA"*300  # your promoter here

# Tokenize and predict
inputs = tokenizer(
    promoter_sequence,
    return_tensors="pt",
    max_length=3000,
    truncation=True,
    padding="max_length"
)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
    nodule_prob = probs[0][1].item()

print(f"Probability of nodule-specific regulation: {nodule_prob:.4f}")

The output should be like

Probability of nodule-specific regulation: 0.0021

Calculation of Shapley scores

Training Data

The model was fine-tuned on a large dataset of plant promoter sequences with nodule-induced genes compiled from 14 plant genomes from the nitrogen-fixing clade:

Data Category Samples Species Included
Nodule-specific promoters 175,365 Aeschynomene evenia, Alnus trabeculosa, Arachis hypogaea, Chamaecrista pumila, Coriaria nepalensis, Datisca glomerata, Elaeagnus umbellata, Glycine max, Hippophae rhamnoides, Lotus japonicus, Medicago truncatula, Mimosa pudica, Parasponia andersonii, Phaseolus vulgaris
Non-nodule promoters 170,912 Matching species background sets

Training Procedure

Fine-tuning Parameters:

  • Epochs: 5
  • Batch size: 8
  • Learning rate: 1e-5
  • Hardware: 1 × Tesla V100 32GB GPU

Evaluation

Performance on evaluation set (n=43285 sequences):

Metric Value
Accuracy 0.90
F1 Score 0.90
Precision 0.85
Recall 0.96
Matthews correlation 0.80

Citation


Model card last updated: July 12, 2025

Downloads last month
-
Safetensors
Model size
96.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train lhui2010/nodule-AI