Model Card for enesyila/ota-ud-style

This is a model for UD-style annotation of Ottoman Turkish sentences for lemma, UPOS, XPOS, morphological features, and dependency. The model is trained with MaCHAMP architecture (Van der Goot, 2021).

Model Details

Model Description

  • Developed by: Enes Yılandiloğlu
  • Shared by: Enes Yılandiloğlu
  • Model type: token classification
  • Language(s) (NLP): Ottoman Turkish (1500-1928)
  • License: cc-by-nc-4.0
  • Finetuned from model: FacebookAI/xlm-roberta-base

Uses

The model can be used to jointly annotate Ottoman Turkish sentences for lemma, UPOS, XPOS, morphological features, and dependency.

Bias, Risks, and Limitations

Due to the vast variety in language use in Ottoman Turkish, this model might fail to correctly annotate some sentences such as the ones with mostly Arabic praises.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. The output of the model is not always the ground truth. Thus, the output should be manually checked.

How to Get Started with the Model

Use the code below to get started with the model. python3 predict.py enesyila/ota-ud-style/model.pt data.conll prediction.out --device 0 See further instruction via https://github.com/machamp-nlp/machamp. It is also possible to extract the model as a transformer model as explained in the aforementioned Github page for MaCHAMP.

Training Details

Training Data

The model was trained on the UD_Ottoman_Turkish-DUDU Universal Dependencies treebank.
The dataset contains morphologically annotated Ottoman Turkish text in CoNLL-U format, including UPOS, XPOS, morphological features, lemmas, and syntactic dependencies.

Training Procedure

Preprocessing

  • Input format: CoNLL-U, tokenized and morphologically annotated.
  • No additional normalization beyond dataset defaults.
  • Tokenization handled by the underlying enesyila/ota-roberta-base SentencePiece tokenizer.
  • Max input length: 128 subword tokens.
  • Tokens were pre-split for sequence labeling tasks (tok.pre_split=true in config).

Training Hyperparameters

  • Base transformer model: enesyila/ota-roberta-base
  • Encoder dropout: 0.2
  • Update encoder weights: Yes
  • Random seed: 8446
  • Epochs: 60 (best epoch: 59)
  • Batch size: 8
  • Max tokens per batch: 1024
  • Optimizer: AdamW (betas=(0.9, 0.99), lr=2e-5, weight_decay=0.01)
  • Learning rate schedule:
    • Discriminative fine-tuning: enabled
    • Gradual unfreezing: enabled
    • cut_frac=0.3, decay_factor=0.38
  • Loss weights per task:
    • Dependency parsing: 1.0 (LAS metric)
    • Lemmatization: 0.8 (accuracy metric)
    • Morphological features: 1.0 (accuracy metric)
    • UPOS: 0.5 (accuracy metric)
    • XPOS: 0.5 (accuracy metric)
  • Layers used for decoding: last 3 layers [-1, -2, -3]

Speeds, Sizes, Times

  • Max GPU memory used: 5.67 GB
  • CPU RAM usage: ~2.14 GB
  • Average epoch time: ~28 seconds
  • Total training time: ~29 minutes 31 seconds
  • Final checkpoint size: ~1.06 GB (model.safetensors)

Evaluation

Task Dev Accuracy
Dependency (LAS) 0.6370
Lemma 0.8618
Morph 0.7689
UPOS 0.9145
XPOS 0.8984

Metrics

The model was evaluated on the UD_Ottoman_Turkish-DUDU dev set using task-appropriate metrics for each annotation type:

  • Labeled Attachment Score (LAS) for dependency parsing — measures the percentage of tokens that are assigned both the correct head and dependency relation label.
  • Accuracy for UPOS (Universal POS) tagging — proportion of tokens with correctly predicted universal POS tags.
  • Accuracy for XPOS (Language-specific POS) tagging — proportion of tokens with correctly predicted language-specific POS tags.
  • Accuracy for Morphological Features — proportion of tokens with all morphological features predicted exactly.
  • Accuracy for Lemmatization — proportion of tokens with the correct base form predicted.

Citation

Yılandiloğlu, E., & Siewert, J. (2025). DUDU: A Treebank for Ottoman Turkish in UD Style. In Š. A. Holdt, N. Ilinykh, B. Scalvini, M. Bruton, I. N. Debess, & C. M. Tudor (Eds.), The Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL 2025) (pp. 74–79). University of Tartu Library.

Model Card Authors

Enes Yılandiloğlu

Model Card Contact

[email protected]

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for enesyila/ota-ud-style

Finetuned
(2)
this model

Dataset used to train enesyila/ota-ud-style

Collection including enesyila/ota-ud-style