Model Card for Model ID

This is a fill-mask model for Ottoman Turkish. The model was finetuned on Ottoman Turkish court records (kadı sicilleri). Thus, it was optimized for legal/administrative texts (court records). The data consists of 15,060,44 words and 26,260,597 tokens from 63,608 court records from Istanbul (1513-1813) and Diyarbakır (1654-1919).

Model Details

Model Description

  • Developed by: Enes Yılandiloğlu
  • Shared by: Enes Yılandiloğlu
  • Model type: fill-mask
  • Language(s) (NLP): Ottoman Turkish (1500-1928)
  • License: cc-by-nc-4.0
  • Finetuned from model: FacebookAI/xlm-roberta-base

Uses

Direct Use

Mask filling & completion of Ottoman Turkish sentences. It was specifically built to work with Ottoman Turkish court records.

Downstream Use

  • Named Entity Recognition
  • UD-style annotation
  • Translation
  • Classification

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

The training data does not align with IJMES transliteration system. To exemplify, the word ﻗﺎﺿﻰ was transliterated as kâdî, instead of how it appears in IJMES, ḳāḍī. Therefore, to use the model, not using IJMES-transliterated data is recommended. Yet, the characters â, î, ô, and û are present in the training data. So, there is no need to convert â into a to use this model properly.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

# 1. Load the finetuned model and tokenizer
model_name = "enesyila/ota-roberta-base-kadisicilleri"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model     = AutoModelForMaskedLM.from_pretrained(model_name)

# 2. Create a mask-filling pipeline
unmasker = pipeline("fill-mask", model=model, tokenizer=tokenizer)

# 3. Run it on an Ottoman-style sentence
sequence = "Mehmed <mask> mahkemeye geldi"
results = unmasker(sequence)

# 4. Print the top 5 predictions
for r in results:
    print(f"{r['sequence']} (score: {r['score']:.4f})")

Training Details

Training Data

The data consists of 15,060,44 words and 26,260,597 tokens from 63,608 court records from Istanbul (1513-1813) and Diyarbakır (1654-1919). Data for the records from Istanbul can be found here and data from Diyarbakır is here.

Training Procedure

Training Hyperparameters

  • Training regime: [More Information Needed]

  • Chunk-size: 512 tokens

  • Batching

    • per_device_train_batch_size=32
    • per_device_eval_batch_size=64
  • Optimizer & Schedule

    • Optimizer: AdamW
    • Learning rate: 2 × 10⁻⁵
    • Learning scheduler: cosine
    • Weight decay: 0.01
    • Warmup ratio: 0.1
  • Training schedule:

    • Number of epochs = 9

Evaluation

Evaluation loss: 0.4429 Perplexity: 1.56

Citation

If you use this model in your research, please cite:

@misc{xlm-roberta-base-ota-kadisicilleri,
  author = {Yılandıloğlu, Enes},
  title = {XLM-RoBERTa Base for Ottoman Turkish Court Records},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\\url{https://huggingface.co/enesyila/ota-roberta-base-kadisicilleri}}
}
## Model Card Authors [optional]

Enes Yılandiloğlu

## Model Card Contact

[email protected]
Downloads last month
38
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for enesyila/ota-roberta-base-kadisicilleri

Finetuned
(3607)
this model