Model Card for Model ID

This is a fill-mask model for Ottoman Turkish. The model was finetuned on Ottoman Turkish court records (kadı sicilleri). Thus, it was optimized for legal/administrative texts (court records). The data consists of 15,060,44 words and 26,260,597 tokens from 63,608 court records from Istanbul (1513-1813) and Diyarbakır (1654-1919).

Model Details

Model Description

Developed by: Enes Yılandiloğlu
Shared by: Enes Yılandiloğlu
Model type: fill-mask
Language(s) (NLP): Ottoman Turkish (1500-1928)
License: cc-by-nc-4.0
Finetuned from model: FacebookAI/xlm-roberta-base

Uses

Direct Use

Mask filling & completion of Ottoman Turkish sentences. It was specifically built to work with Ottoman Turkish court records.

Downstream Use

Named Entity Recognition
UD-style annotation
Translation
Classification

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

The training data does not align with IJMES transliteration system. To exemplify, the word ﻗﺎﺿﻰ was transliterated as kâdî, instead of how it appears in IJMES, ḳāḍī. Therefore, to use the model, not using IJMES-transliterated data is recommended. Yet, the characters â, î, ô, and û are present in the training data. So, there is no need to convert â into a to use this model properly.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

# 1. Load the finetuned model and tokenizer
model_name = "enesyila/ota-roberta-base-kadisicilleri"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model     = AutoModelForMaskedLM.from_pretrained(model_name)

# 2. Create a mask-filling pipeline
unmasker = pipeline("fill-mask", model=model, tokenizer=tokenizer)

# 3. Run it on an Ottoman-style sentence
sequence = "Mehmed <mask> mahkemeye geldi"
results = unmasker(sequence)

# 4. Print the top 5 predictions
for r in results:
    print(f"{r['sequence']} (score: {r['score']:.4f})")

Training Details

Training Data

The data consists of 15,060,44 words and 26,260,597 tokens from 63,608 court records from Istanbul (1513-1813) and Diyarbakır (1654-1919). Data for the records from Istanbul can be found here and data from Diyarbakır is here.

Training Procedure

Training Hyperparameters

Training regime: [More Information Needed]
Chunk-size: 512 tokens
Batching
- per_device_train_batch_size=32
- per_device_eval_batch_size=64
Optimizer & Schedule
- Optimizer: AdamW
- Learning rate: 2 × 10⁻⁵
- Learning scheduler: cosine
- Weight decay: 0.01
- Warmup ratio: 0.1
Training schedule:
- Number of epochs = 9

Evaluation

Evaluation loss: 0.4429 Perplexity: 1.56

Citation

If you use this model in your research, please cite:

@misc{xlm-roberta-base-ota-kadisicilleri,
  author = {Yılandıloğlu, Enes},
  title = {XLM-RoBERTa Base for Ottoman Turkish Court Records},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\\url{https://huggingface.co/enesyila/ota-roberta-base-kadisicilleri}}
}
## Model Card Authors [optional]

Enes Yılandiloğlu

## Model Card Contact

[email protected]

Downloads last month: 38

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for enesyila/ota-roberta-base-kadisicilleri

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3607)

this model