ota-roberta-base-ner (Ottoman Turkish NER)

This model is the state-of-the-art Named Entity Recognition (NER) model for Ottoman Turkish, fine-tuned from enesyila/ota-mdeberta-v3-base.
It recognizes PERSON, LOCATION, ORGANIZATION, and MISC entities in Ottoman Turkish texts.


Model Details

  • Developed by: Enes Yılandiloğlu
  • Model type: Token classification (NER)
  • Language(s): Ottoman Turkish (ota)
  • License: cc-by-nc-4.0
  • Finetuned from: enesyila/ota-mdeberta-v3-base-ner

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model = AutoModelForTokenClassification.from_pretrained("enesyila/ota-mdeberta-v3-base-ner")
tokenizer = AutoTokenizer.from_pretrained("enesyila/ota-mdeberta-v3-base-ner")

nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="average")

text = "Aḥmed Paşanın yerine Edrinedeki Meḥmed Efendi Medresesinden Meḥmed Efendi mevsûl oldu."
print(nlp(text))
[{'entity_group': 'PER','score': 0.9800526,'word': 'Aḥmed Paşa','start': 0,'end': 10},
{'entity_group': 'LOC','score': 0.95372033,'word': 'Edrine','start': 21,'end': 27},
{'entity_group': 'ORG','score': 0.8995747,'word': 'Meḥmed Efendi Medresesinden','start': 32,'end': 59},
{'entity_group': 'PER','score': 0.9849827,'word': 'Meḥmed Efendi','start': 60,'end': 73}]

Training Procedure

  • Loss: Cross-entropy loss
  • Batch size: 16 (train), 16 (eval)
  • Optimizer: AdamW
  • Learning rate: 3e-5
  • Learning rate scheduler: Linear
  • Warmup ratio: 0.06
  • Epochs: 10
  • Gradient checkpointing: Enabled
  • Mixed precision: Enabled (fp16)

Training Data

The model was fine-tuned on a manually annotated corpus of 6 classical Ottoman Turkish in both prose and verse with IJMES transliteration alphabet, consisting of 9,960 NER spans with labels PER, LOC, ORG, MISC. Folowing works were used as training data:

  • Kitâb-ı Fâhir Kıssa-i Anter bin Şeddâd bin Kırâd el-Absî (15th century)
  • Ḳıṣâṣ-i Enbiyâ (16th century)
  • Zeyl-i Şakâʾik (17th century)
  • Veḳâyiʿü'l-Fużala (1731)
  • Neticetü'l-Fikriyye (18th century)
  • Silkü'l-Leʾal-i ʿÂl-i Os̱mân (18th century)

Named entity distribution by dataset split (roughly 80/10/10):

Split LOC MISC ORG PER TOTAL
Train 1651 777 1374 4152 7954
Dev 165 123 167 556 1011
Test 223 127 147 498 995
Total 2039 1027 1688 5206 9960

Evaluation Results

Span-level results on test set:

Label Precision Recall F1-score Support
LOC 0.8857 0.9073 0.8964 205
MISC 0.7027 0.7761 0.7376 134
ORG 0.8571 0.8344 0.8456 151
PER 0.8985 0.9415 0.9195 564

Span-level (micro avg):

  • Precision: 0.8641
  • Recall: 0.8985
  • F1: 0.8809

Span-level (macro avg):

  • Precision: 0.8360
  • Recall: 0.8648
  • F1: 0.8498

Token-level results (excluding “O” label)

Label Precision Recall F1-score Support
B-PER 0.9286 0.9669 0.9474 242
I-PER 0.9800 0.9770 0.9785 4254
B-LOC 0.9459 0.9211 0.9333 76
I-LOC 0.9073 0.9375 0.9221 720
B-ORG 0.9667 0.7632 0.8529 38
I-ORG 0.9690 0.9265 0.9473 1115
B-MISC 0.8571 0.8136 0.8348 59
I-MISC 0.8716 0.8768 0.8742 852

Model Card Author

Enes Yılandiloğlu

Model Card Contact

[email protected]

Downloads last month
3
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for enesyila/ota-mdeberta-v3-base-ner

Finetuned
(1)
this model

Collection including enesyila/ota-mdeberta-v3-base-ner