Model Card for enesyila/ota-ud-style
This is a model for UD-style annotation of Ottoman Turkish sentences for lemma, UPOS, XPOS, morphological features, and dependency. The model is trained with MaCHAMP architecture (Van der Goot, 2021).
Model Details
Model Description
- Developed by: Enes Yılandiloğlu
- Shared by: Enes Yılandiloğlu
- Model type: token classification
- Language(s) (NLP): Ottoman Turkish (1500-1928)
- License: cc-by-nc-4.0
- Finetuned from model: FacebookAI/xlm-roberta-base
Uses
The model can be used to jointly annotate Ottoman Turkish sentences for lemma, UPOS, XPOS, morphological features, and dependency.
Bias, Risks, and Limitations
Due to the vast variety in language use in Ottoman Turkish, this model might fail to correctly annotate some sentences such as the ones with mostly Arabic praises.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. The output of the model is not always the ground truth. Thus, the output should be manually checked.
How to Get Started with the Model
Use the code below to get started with the model. python3 predict.py enesyila/ota-ud-style/model.pt data.conll prediction.out --device 0 See further instruction via https://github.com/machamp-nlp/machamp. It is also possible to extract the model as a transformer model as explained in the aforementioned Github page for MaCHAMP.
Training Details
Training Data
The model was trained on the UD_Ottoman_Turkish-DUDU Universal Dependencies treebank.
The dataset contains morphologically annotated Ottoman Turkish text in CoNLL-U format, including UPOS, XPOS, morphological features, lemmas, and syntactic dependencies.
Training Procedure
Preprocessing
- Input format: CoNLL-U, tokenized and morphologically annotated.
- No additional normalization beyond dataset defaults.
- Tokenization handled by the underlying
enesyila/ota-roberta-baseSentencePiece tokenizer. - Max input length: 128 subword tokens.
- Tokens were pre-split for sequence labeling tasks (
tok.pre_split=truein config).
Training Hyperparameters
- Base transformer model:
enesyila/ota-roberta-base - Encoder dropout: 0.2
- Update encoder weights: Yes
- Random seed: 8446
- Epochs: 60 (best epoch: 59)
- Batch size: 8
- Max tokens per batch: 1024
- Optimizer: AdamW (
betas=(0.9, 0.99),lr=2e-5,weight_decay=0.01) - Learning rate schedule:
- Discriminative fine-tuning: enabled
- Gradual unfreezing: enabled
cut_frac=0.3,decay_factor=0.38
- Loss weights per task:
- Dependency parsing: 1.0 (LAS metric)
- Lemmatization: 0.8 (accuracy metric)
- Morphological features: 1.0 (accuracy metric)
- UPOS: 0.5 (accuracy metric)
- XPOS: 0.5 (accuracy metric)
- Layers used for decoding: last 3 layers
[-1, -2, -3]
Speeds, Sizes, Times
- Max GPU memory used: 5.67 GB
- CPU RAM usage: ~2.14 GB
- Average epoch time: ~28 seconds
- Total training time: ~29 minutes 31 seconds
- Final checkpoint size: ~1.06 GB (
model.safetensors)
Evaluation
| Task | Dev Accuracy |
|---|---|
| Dependency (LAS) | 0.6370 |
| Lemma | 0.8618 |
| Morph | 0.7689 |
| UPOS | 0.9145 |
| XPOS | 0.8984 |
Metrics
The model was evaluated on the UD_Ottoman_Turkish-DUDU dev set using task-appropriate metrics for each annotation type:
- Labeled Attachment Score (LAS) for dependency parsing — measures the percentage of tokens that are assigned both the correct head and dependency relation label.
- Accuracy for UPOS (Universal POS) tagging — proportion of tokens with correctly predicted universal POS tags.
- Accuracy for XPOS (Language-specific POS) tagging — proportion of tokens with correctly predicted language-specific POS tags.
- Accuracy for Morphological Features — proportion of tokens with all morphological features predicted exactly.
- Accuracy for Lemmatization — proportion of tokens with the correct base form predicted.
Citation
Yılandiloğlu, E., & Siewert, J. (2025). DUDU: A Treebank for Ottoman Turkish in UD Style. In Š. A. Holdt, N. Ilinykh, B. Scalvini, M. Bruton, I. N. Debess, & C. M. Tudor (Eds.), The Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL 2025) (pp. 74–79). University of Tartu Library.
Model Card Authors
Enes Yılandiloğlu