Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text

[NAACL 2025]

Model Details

Model Description

The Protein2Text model is a multimodal transformer-based model designed to generate human-interpretable text from protein sequences. It combines a protein sequence encoder (ESM2) and a large language model (LLaMA 3.1-8B Instruct), leveraging resampling mechanisms to improve text generation. The model was trained and fine-tuned on the Protein2Text-QA dataset, which contains question-answer (QA) pairs generated from biomedical literature.

  • Developed by: TumorAI Lab
  • Model Type: Multimodal Instruction-Tuned Transformer
  • Language(s) (NLP): English (Biomedical Domain)
  • License: Apache 2.0
  • Finetuned from model: meta-llama/Meta-Llama-3.1-8B-Instruct

Model Sources

  • Repository: GitHub Repository
  • Paper: [More Information Needed]
  • Demo: [More Information Needed]

Uses

Direct Use

  • Generating textual descriptions of protein functions from protein sequences.
  • Biomedical research and explainable AI applications in genomics and proteomics.

Downstream Use

  • Can be fine-tuned for specific protein annotation tasks.
  • Can be adapted for biomedical question-answering related to proteins.

Out-of-Scope Use

  • Not designed for general NLP tasks outside of biomedical research.
  • Should not be used for clinical decision-making without expert validation.

Bias, Risks, and Limitations

  • The model relies on automatically generated QA pairs, which may introduce hallucinated or inaccurate information.
  • Some rare proteins may not have sufficient training data, leading to unreliable outputs.
  • Always verify outputs with domain experts.
  • Further fine-tuning may be required for specific biomedical applications.

Training Details

Training Data

The model was fine-tuned on the Protein2Text-QA dataset, which includes:

  • Protein-related abstracts retrieved from PubMed Central (PMC).
  • QA pairs generated using LLaMA3, conditioned on specific protein mentions.

Training Procedure

Preprocessing

  • Abstract cleaning: Removal of redundant sections (e.g., "Methods", "Conclusion").
  • QA filtering: Removing phrases like "no information found".

Training Hyperparameters

Phase Global Batch Size Learning Rate Epochs Max Length Weight Decay Precision Optimizer Gradient Accumulation Steps Warmup Ratio
Pretraining 256 2 × 10⁻³ 1 2048 0 bf16 (Mixed Precision) AdamW 1 step 0.03
Fine-tuning 128 8 × 10⁻⁶ 5 2048 0 bf16 (Mixed Precision) AdamW 1 step 0.03

Evaluation

Metrics

  • BLEU-2, BLEU-4 (for text quality).
  • ROUGE-1, ROUGE-2, ROUGE-L (for relevance).
  • METEOR (for fluency).

Model Examination

  • Hardware Used: 2 × NVIDIA H100 PCIe 82GB
  • Training Hours: 12-15 hours

Citation

BibTeX:

@inproceedings{jararweh2025protein2text,
  title={Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text},
  author={Jararweh, Ala and Macaulay, Oladimeji and Arredondo, David and Hu, Yue and Tafoya, Luis E and Virupakshappa, Kushal and Sahu, Avinash},
  booktitle={Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)},
  pages={918--937},
  year={2025}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train tumorailab/protein2text-llama3.1-8B-instruct-esm2-650M

Collection including tumorailab/protein2text-llama3.1-8B-instruct-esm2-650M