Model Card for Model ID
This is a fill-mask model for Ottoman Turkish. The model was finetuned on Ottoman Turkish court records (kadı sicilleri). Thus, it was optimized for legal/administrative texts (court records). The data consists of 15,060,44 words and 26,260,597 tokens from 63,608 court records from Istanbul (1513-1813) and Diyarbakır (1654-1919).
Model Details
Model Description
- Developed by: Enes Yılandiloğlu
- Shared by: Enes Yılandiloğlu
- Model type: fill-mask
- Language(s) (NLP): Ottoman Turkish (1500-1928)
- License: cc-by-nc-4.0
- Finetuned from model: FacebookAI/xlm-roberta-base
Uses
Direct Use
Mask filling & completion of Ottoman Turkish sentences. It was specifically built to work with Ottoman Turkish court records.
Downstream Use
- Named Entity Recognition
- UD-style annotation
- Translation
- Classification
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
The training data does not align with IJMES transliteration system. To exemplify, the word ﻗﺎﺿﻰ was transliterated as kâdî, instead of how it appears in IJMES, ḳāḍī. Therefore, to use the model, not using IJMES-transliterated data is recommended. Yet, the characters â, î, ô, and û are present in the training data. So, there is no need to convert â into a to use this model properly.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
# 1. Load the finetuned model and tokenizer
model_name = "enesyila/ota-roberta-base-kadisicilleri"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
# 2. Create a mask-filling pipeline
unmasker = pipeline("fill-mask", model=model, tokenizer=tokenizer)
# 3. Run it on an Ottoman-style sentence
sequence = "Mehmed <mask> mahkemeye geldi"
results = unmasker(sequence)
# 4. Print the top 5 predictions
for r in results:
print(f"{r['sequence']} (score: {r['score']:.4f})")
Training Details
Training Data
The data consists of 15,060,44 words and 26,260,597 tokens from 63,608 court records from Istanbul (1513-1813) and Diyarbakır (1654-1919). Data for the records from Istanbul can be found here and data from Diyarbakır is here.
Training Procedure
Training Hyperparameters
Training regime: [More Information Needed]
Chunk-size: 512 tokens
Batching
per_device_train_batch_size=32per_device_eval_batch_size=64
Optimizer & Schedule
- Optimizer: AdamW
- Learning rate: 2 × 10⁻⁵
- Learning scheduler: cosine
- Weight decay: 0.01
- Warmup ratio: 0.1
Training schedule:
- Number of epochs = 9
Evaluation
Evaluation loss: 0.4429 Perplexity: 1.56
Citation
If you use this model in your research, please cite:
@misc{xlm-roberta-base-ota-kadisicilleri,
author = {Yılandıloğlu, Enes},
title = {XLM-RoBERTa Base for Ottoman Turkish Court Records},
year = {2025},
publisher = {Hugging Face},
howpublished = {\\url{https://huggingface.co/enesyila/ota-roberta-base-kadisicilleri}}
}
## Model Card Authors [optional]
Enes Yılandiloğlu
## Model Card Contact
[email protected]
- Downloads last month
- 38
Model tree for enesyila/ota-roberta-base-kadisicilleri
Base model
FacebookAI/xlm-roberta-base