---
license: cc-by-nc-4.0
language:
- ota
datasets:
- enesyila/UD_Ottoman_Turkish-DUDU
metrics:
- accuracy
base_model:
- enesyila/ota-roberta-base
pipeline_tag: token-classification
tags:
- ottoman-turkish
- UD
- lemma
- UPOS
- XPOS
- morphology
- syntax
---
# Model Card for enesyila/ota-ud-style

<!-- Provide a quick summary of what the model is/does. -->

This is a model for UD-style annotation of Ottoman Turkish sentences for lemma, UPOS, XPOS, morphological features, and dependency.
The model is trained with MaCHAMP architecture (Van der Goot, 2021).
## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->


- **Developed by:** Enes Yılandiloğlu
- **Shared by:** Enes Yılandiloğlu
- **Model type:** token classification
- **Language(s) (NLP):** Ottoman Turkish (1500-1928)
- **License:** cc-by-nc-4.0
- **Finetuned from model:** FacebookAI/xlm-roberta-base


## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
The model can be used to jointly annotate Ottoman Turkish sentences for lemma, UPOS, XPOS, morphological features, and dependency.


## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

Due to the vast variety in language use in Ottoman Turkish, this model might fail to correctly annotate some sentences such as the ones with mostly Arabic praises.

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. The output of the model is not always the ground truth. Thus, the output should be manually checked.

## How to Get Started with the Model

Use the code below to get started with the model.
python3 predict.py enesyila/ota-ud-style/model.pt data.conll prediction.out --device 0
See further instruction via https://github.com/machamp-nlp/machamp. It is also possible to extract the model as a transformer model as explained in the aforementioned Github page for MaCHAMP.

## Training Details

### Training Data

The model was trained on the [UD_Ottoman_Turkish-DUDU](https://github.com/UniversalDependencies/UD_Ottoman_Turkish-DUDU/tree/dev) Universal Dependencies treebank.  
The dataset contains morphologically annotated Ottoman Turkish text in CoNLL-U format, including UPOS, XPOS, morphological features, lemmas, and syntactic dependencies.

### Training Procedure

#### Preprocessing

- Input format: CoNLL-U, tokenized and morphologically annotated.
- No additional normalization beyond dataset defaults.
- Tokenization handled by the underlying `enesyila/ota-roberta-base` SentencePiece tokenizer.
- Max input length: **128** subword tokens.
- Tokens were pre-split for sequence labeling tasks (`tok.pre_split=true` in config).

#### Training Hyperparameters

- **Base transformer model:** `enesyila/ota-roberta-base`
- **Encoder dropout:** 0.2
- **Update encoder weights:** Yes
- **Random seed:** 8446
- **Epochs:** 60 (best epoch: 59)
- **Batch size:** 8
- **Max tokens per batch:** 1024
- **Optimizer:** AdamW (`betas=(0.9, 0.99)`, `lr=2e-5`, `weight_decay=0.01`)
- **Learning rate schedule:**
  - Discriminative fine-tuning: enabled
  - Gradual unfreezing: enabled
  - `cut_frac=0.3`, `decay_factor=0.38`
- **Loss weights per task:**
  - Dependency parsing: 1.0 (LAS metric)
  - Lemmatization: 0.8 (accuracy metric)
  - Morphological features: 1.0 (accuracy metric)
  - UPOS: 0.5 (accuracy metric)
  - XPOS: 0.5 (accuracy metric)
- **Layers used for decoding:** last 3 layers `[-1, -2, -3]`

#### Speeds, Sizes, Times

- **Max GPU memory used:** 5.67 GB
- **CPU RAM usage:** ~2.14 GB
- **Average epoch time:** ~28 seconds
- **Total training time:** ~29 minutes 31 seconds
- **Final checkpoint size:** ~1.06 GB (`model.safetensors`)


## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

| Task        | Dev Accuracy |
|-------------|--------------|
| Dependency (LAS) | 0.6370       |
| Lemma       | 0.8618       |
| Morph       | 0.7689       |
| UPOS        | 0.9145       |
| XPOS        | 0.8984       |


#### Metrics

The model was evaluated on the UD_Ottoman_Turkish-DUDU dev set using task-appropriate metrics for each annotation type:

- **Labeled Attachment Score (LAS)** for dependency parsing — measures the percentage of tokens that are assigned both the correct head and dependency relation label.
- **Accuracy** for UPOS (Universal POS) tagging — proportion of tokens with correctly predicted universal POS tags.
- **Accuracy** for XPOS (Language-specific POS) tagging — proportion of tokens with correctly predicted language-specific POS tags.
- **Accuracy** for Morphological Features — proportion of tokens with all morphological features predicted exactly.
- **Accuracy** for Lemmatization — proportion of tokens with the correct base form predicted.


## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
Yılandiloğlu, E., & Siewert, J. (2025). DUDU: A Treebank for Ottoman Turkish in UD Style. In Š. A. Holdt, N. Ilinykh, B. Scalvini, M. Bruton, I. N. Debess, & C. M. Tudor (Eds.), The Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL 2025) (pp. 74–79). University of Tartu Library.


## Model Card Authors

Enes Yılandiloğlu

## Model Card Contact

enes.yilandiloglu@helsinki.fi