ota-nlp
Collection
This collection includes NLP models and datasets for Ottoman Turkish.
•
6 items
•
Updated
This model is the state-of-the-art Named Entity Recognition (NER) model for Ottoman Turkish, fine-tuned from enesyila/ota-mdeberta-v3-base.
It recognizes PERSON, LOCATION, ORGANIZATION, and MISC entities in Ottoman Turkish texts.
Use the code below to get started with the model.
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model = AutoModelForTokenClassification.from_pretrained("enesyila/ota-mdeberta-v3-base-ner")
tokenizer = AutoTokenizer.from_pretrained("enesyila/ota-mdeberta-v3-base-ner")
nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="average")
text = "Aḥmed Paşanın yerine Edrinedeki Meḥmed Efendi Medresesinden Meḥmed Efendi mevsûl oldu."
print(nlp(text))
[{'entity_group': 'PER','score': 0.9800526,'word': 'Aḥmed Paşa','start': 0,'end': 10},
{'entity_group': 'LOC','score': 0.95372033,'word': 'Edrine','start': 21,'end': 27},
{'entity_group': 'ORG','score': 0.8995747,'word': 'Meḥmed Efendi Medresesinden','start': 32,'end': 59},
{'entity_group': 'PER','score': 0.9849827,'word': 'Meḥmed Efendi','start': 60,'end': 73}]
The model was fine-tuned on a manually annotated corpus of 6 classical Ottoman Turkish in both prose and verse with IJMES transliteration alphabet, consisting of 9,960 NER spans with labels PER, LOC, ORG, MISC.
Folowing works were used as training data:
Named entity distribution by dataset split (roughly 80/10/10):
| Split | LOC | MISC | ORG | PER | TOTAL |
|---|---|---|---|---|---|
| Train | 1651 | 777 | 1374 | 4152 | 7954 |
| Dev | 165 | 123 | 167 | 556 | 1011 |
| Test | 223 | 127 | 147 | 498 | 995 |
| Total | 2039 | 1027 | 1688 | 5206 | 9960 |
Span-level results on test set:
| Label | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| LOC | 0.8857 | 0.9073 | 0.8964 | 205 |
| MISC | 0.7027 | 0.7761 | 0.7376 | 134 |
| ORG | 0.8571 | 0.8344 | 0.8456 | 151 |
| PER | 0.8985 | 0.9415 | 0.9195 | 564 |
Span-level (micro avg):
Span-level (macro avg):
| Label | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| B-PER | 0.9286 | 0.9669 | 0.9474 | 242 |
| I-PER | 0.9800 | 0.9770 | 0.9785 | 4254 |
| B-LOC | 0.9459 | 0.9211 | 0.9333 | 76 |
| I-LOC | 0.9073 | 0.9375 | 0.9221 | 720 |
| B-ORG | 0.9667 | 0.7632 | 0.8529 | 38 |
| I-ORG | 0.9690 | 0.9265 | 0.9473 | 1115 |
| B-MISC | 0.8571 | 0.8136 | 0.8348 | 59 |
| I-MISC | 0.8716 | 0.8768 | 0.8742 | 852 |
Enes Yılandiloğlu
Base model
microsoft/mdeberta-v3-base