Nodule-AI—A deep learning model for Nodule-Specific Gene Identification

Model Description

This model is a fine-tuned version of the zhangtaolab/plant-dnamamba-BPE architecture specialized for identifying nodule-specific genes based on promoter DNA sequences. The base model was pretrained on plant genomic sequences using a Mamba-based architecture with Byte Pair Encoding (BPE), which we've adapted for promoter analysis through targeted fine-tuning.

How to Use

NVIDIA GPU is required

Installation

conda create -n llms python=3.11
conda activate llms
pip install  'torch<2.4'  'mambapy<=1.2.0' 'transformers<4.46' 'causal-conv1d<=1.3' 'mamba-ssm<2'

It may take ~15 min for a fresh install

Basic Inference

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "lhui2010/nodule-AI"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True)

# Prepare input (3000bp promoter sequence)
promoter_sequence = "ATGCGTCTCA"*300  # your promoter here

# Tokenize and predict
inputs = tokenizer(
    promoter_sequence,
    return_tensors="pt",
    max_length=3000,
    truncation=True,
    padding="max_length"
)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
    nodule_prob = probs[0][1].item()

print(f"Probability of nodule-specific regulation: {nodule_prob:.4f}")

The output should be like

Probability of nodule-specific regulation: 0.0021

Calculation of Shapley scores

Training Data

The model was fine-tuned on a large dataset of plant promoter sequences with nodule-induced genes compiled from 14 plant genomes from the nitrogen-fixing clade:

Data Category	Samples	Species Included
Nodule-specific promoters	175,365	Aeschynomene evenia, Alnus trabeculosa, Arachis hypogaea, Chamaecrista pumila, Coriaria nepalensis, Datisca glomerata, Elaeagnus umbellata, Glycine max, Hippophae rhamnoides, Lotus japonicus, Medicago truncatula, Mimosa pudica, Parasponia andersonii, Phaseolus vulgaris
Non-nodule promoters	170,912	Matching species background sets

Training Procedure

Fine-tuning Parameters:

Epochs: 5
Batch size: 8
Learning rate: 1e-5
Hardware: 1 × Tesla V100 32GB GPU

Evaluation

Performance on evaluation set (n=43285 sequences):

Metric	Value
Accuracy	0.90
F1 Score	0.90
Precision	0.85
Recall	0.96
Matthews correlation	0.80

Citation

Model card last updated: July 12, 2025

Downloads last month: -

Safetensors

Model size

96.7M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

lhui2010
/

nodule-AI