Model Card for enesyila/ota-ud-style

This is a model for UD-style annotation of Ottoman Turkish sentences for lemma, UPOS, XPOS, morphological features, and dependency. The model is trained with MaCHAMP architecture (Van der Goot, 2021).

Model Details

Model Description

Developed by: Enes Yılandiloğlu
Shared by: Enes Yılandiloğlu
Model type: token classification
Language(s) (NLP): Ottoman Turkish (1500-1928)
License: cc-by-nc-4.0
Finetuned from model: FacebookAI/xlm-roberta-base

Uses

The model can be used to jointly annotate Ottoman Turkish sentences for lemma, UPOS, XPOS, morphological features, and dependency.

Bias, Risks, and Limitations

Due to the vast variety in language use in Ottoman Turkish, this model might fail to correctly annotate some sentences such as the ones with mostly Arabic praises.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. The output of the model is not always the ground truth. Thus, the output should be manually checked.

How to Get Started with the Model

Use the code below to get started with the model. python3 predict.py enesyila/ota-ud-style/model.pt data.conll prediction.out --device 0 See further instruction via https://github.com/machamp-nlp/machamp. It is also possible to extract the model as a transformer model as explained in the aforementioned Github page for MaCHAMP.

Training Details

Training Data

The model was trained on the UD_Ottoman_Turkish-DUDU Universal Dependencies treebank.
The dataset contains morphologically annotated Ottoman Turkish text in CoNLL-U format, including UPOS, XPOS, morphological features, lemmas, and syntactic dependencies.

Training Procedure

Preprocessing

Input format: CoNLL-U, tokenized and morphologically annotated.
No additional normalization beyond dataset defaults.
Tokenization handled by the underlying enesyila/ota-roberta-base SentencePiece tokenizer.
Max input length: 128 subword tokens.
Tokens were pre-split for sequence labeling tasks (tok.pre_split=true in config).

Training Hyperparameters

Base transformer model: enesyila/ota-roberta-base
Encoder dropout: 0.2
Update encoder weights: Yes
Random seed: 8446
Epochs: 60 (best epoch: 59)
Batch size: 8
Max tokens per batch: 1024
Optimizer: AdamW (betas=(0.9, 0.99), lr=2e-5, weight_decay=0.01)
Learning rate schedule:
- Discriminative fine-tuning: enabled
- Gradual unfreezing: enabled
- cut_frac=0.3, decay_factor=0.38
Loss weights per task:
- Dependency parsing: 1.0 (LAS metric)
- Lemmatization: 0.8 (accuracy metric)
- Morphological features: 1.0 (accuracy metric)
- UPOS: 0.5 (accuracy metric)
- XPOS: 0.5 (accuracy metric)
Layers used for decoding: last 3 layers [-1, -2, -3]

Speeds, Sizes, Times

Max GPU memory used: 5.67 GB
CPU RAM usage: ~2.14 GB
Average epoch time: ~28 seconds
Total training time: ~29 minutes 31 seconds
Final checkpoint size: ~1.06 GB (model.safetensors)

Evaluation

Task	Dev Accuracy
Dependency (LAS)	0.6370
Lemma	0.8618
Morph	0.7689
UPOS	0.9145
XPOS	0.8984

Metrics

The model was evaluated on the UD_Ottoman_Turkish-DUDU dev set using task-appropriate metrics for each annotation type:

Labeled Attachment Score (LAS) for dependency parsing — measures the percentage of tokens that are assigned both the correct head and dependency relation label.
Accuracy for UPOS (Universal POS) tagging — proportion of tokens with correctly predicted universal POS tags.
Accuracy for XPOS (Language-specific POS) tagging — proportion of tokens with correctly predicted language-specific POS tags.
Accuracy for Morphological Features — proportion of tokens with all morphological features predicted exactly.
Accuracy for Lemmatization — proportion of tokens with the correct base form predicted.

Citation

Yılandiloğlu, E., & Siewert, J. (2025). DUDU: A Treebank for Ottoman Turkish in UD Style. In Š. A. Holdt, N. Ilinykh, B. Scalvini, M. Bruton, I. N. Debess, & C. M. Tudor (Eds.), The Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL 2025) (pp. 74–79). University of Tartu Library.