🧬 Biomedical NER — Experimental Model (BC5CDR)
This model is a fine-tuned version of microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract on the bc5cdr dataset. It achieves the following results on the evaluation set:
- Loss: 0.0835
- Precision: 0.8582
- Recall: 0.8977
- F1: 0.8775
- Accuracy: 0.9727
This repository hosts a testing version of a biomedical named entity recognition (NER) model fine-tuned on the BC5CDR dataset, which contains PubMed abstracts annotated for diseases and chemicals.
⚠️ Note: This model is shared for educational and testing purposes only. It is not intended for clinical or production use and may not generalize beyond the BC5CDR dataset.
Model overview
- Base model: PubMedBERT (abstracts) (domain-specific transformer)
- Task: Named Entity Recognition (token classification)
- Entities:
ChemicalDiseaseO(non-entity tokens)
The model predicts entity spans following the standard BIO tagging scheme, e.g. B-Chemical, I-Disease, O.
Training notebook
The full training and preprocessing pipeline — including dataset preparation, sentence splitting, BIO tagging, and evaluation — is documented in this Kaggle notebook
Example usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("Francesco-A/BiomedNLP-PubMedBERT-base-uncased-abstract-bc5cdr-ner-v1")
model = AutoModelForTokenClassification.from_pretrained("Francesco-A/BiomedNLP-PubMedBERT-base-uncased-abstract-bc5cdr-ner-v1")
ner = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="first")
demo_texts = [
"Aspirin is often used to treat inflammation, but may cause gastric bleeding.",
"Naloxone reverses the antihypertensive effect of clonidine.",
]
for t in demo_texts:
print("\nText:", t)
preds = ner(t)
for p in preds:
print(f' - {p["word"]}: {p["entity_group"]} (score {p["score"]:.2f})')
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-5
- train_batch_size: 64
- eval_batch_size: 64
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 256
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 8
- mixed_precision_training: Native AMP
Training results
| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
|---|---|---|---|---|---|---|---|
| No log | 0.9836 | 15 | 0.5775 | 0.0299 | 0.0002 | 0.0004 | 0.8627 |
| No log | 1.9672 | 30 | 0.2332 | 0.7252 | 0.5753 | 0.6416 | 0.9284 |
| No log | 2.9508 | 45 | 0.1237 | 0.7610 | 0.8265 | 0.7924 | 0.9606 |
| No log | 4.0 | 61 | 0.0937 | 0.8364 | 0.8859 | 0.8605 | 0.9690 |
| No log | 4.9836 | 76 | 0.0860 | 0.8510 | 0.8927 | 0.8714 | 0.9714 |
| No log | 5.9672 | 91 | 0.0833 | 0.8502 | 0.8995 | 0.8741 | 0.9724 |
| 0.3513 | 6.9508 | 106 | 0.0859 | 0.8494 | 0.8992 | 0.8736 | 0.9714 |
| 0.3513 | 7.8689 | 120 | 0.0835 | 0.8582 | 0.8977 | 0.8775 | 0.9727 |
Framework versions
- Transformers 4.45.0
- Pytorch 2.6.0+cu124
- Datasets 3.6.0
- Tokenizers 0.20.3
- Downloads last month
- 2
Model tree for Francesco-A/BiomedNLP-PubMedBERT-base-uncased-abstract-bc5cdr-ner-v1
Collection including Francesco-A/BiomedNLP-PubMedBERT-base-uncased-abstract-bc5cdr-ner-v1
Evaluation results
- Precision on bc5cdrvalidation set self-reported0.858
- Recall on bc5cdrvalidation set self-reported0.898
- F1 on bc5cdrvalidation set self-reported0.877
- Accuracy on bc5cdrvalidation set self-reported0.973