--- license: cc-by-nc-4.0 language: - ota datasets: - enesyila/UD_Ottoman_Turkish-DUDU metrics: - accuracy base_model: - enesyila/ota-roberta-base pipeline_tag: token-classification tags: - ottoman-turkish - UD - lemma - UPOS - XPOS - morphology - syntax --- # Model Card for enesyila/ota-ud-style This is a model for UD-style annotation of Ottoman Turkish sentences for lemma, UPOS, XPOS, morphological features, and dependency. The model is trained with MaCHAMP architecture (Van der Goot, 2021). ## Model Details ### Model Description - **Developed by:** Enes Yılandiloğlu - **Shared by:** Enes Yılandiloğlu - **Model type:** token classification - **Language(s) (NLP):** Ottoman Turkish (1500-1928) - **License:** cc-by-nc-4.0 - **Finetuned from model:** FacebookAI/xlm-roberta-base ## Uses The model can be used to jointly annotate Ottoman Turkish sentences for lemma, UPOS, XPOS, morphological features, and dependency. ## Bias, Risks, and Limitations Due to the vast variety in language use in Ottoman Turkish, this model might fail to correctly annotate some sentences such as the ones with mostly Arabic praises. ### Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. The output of the model is not always the ground truth. Thus, the output should be manually checked. ## How to Get Started with the Model Use the code below to get started with the model. python3 predict.py enesyila/ota-ud-style/model.pt data.conll prediction.out --device 0 See further instruction via https://github.com/machamp-nlp/machamp. It is also possible to extract the model as a transformer model as explained in the aforementioned Github page for MaCHAMP. ## Training Details ### Training Data The model was trained on the [UD_Ottoman_Turkish-DUDU](https://github.com/UniversalDependencies/UD_Ottoman_Turkish-DUDU/tree/dev) Universal Dependencies treebank. The dataset contains morphologically annotated Ottoman Turkish text in CoNLL-U format, including UPOS, XPOS, morphological features, lemmas, and syntactic dependencies. ### Training Procedure #### Preprocessing - Input format: CoNLL-U, tokenized and morphologically annotated. - No additional normalization beyond dataset defaults. - Tokenization handled by the underlying `enesyila/ota-roberta-base` SentencePiece tokenizer. - Max input length: **128** subword tokens. - Tokens were pre-split for sequence labeling tasks (`tok.pre_split=true` in config). #### Training Hyperparameters - **Base transformer model:** `enesyila/ota-roberta-base` - **Encoder dropout:** 0.2 - **Update encoder weights:** Yes - **Random seed:** 8446 - **Epochs:** 60 (best epoch: 59) - **Batch size:** 8 - **Max tokens per batch:** 1024 - **Optimizer:** AdamW (`betas=(0.9, 0.99)`, `lr=2e-5`, `weight_decay=0.01`) - **Learning rate schedule:** - Discriminative fine-tuning: enabled - Gradual unfreezing: enabled - `cut_frac=0.3`, `decay_factor=0.38` - **Loss weights per task:** - Dependency parsing: 1.0 (LAS metric) - Lemmatization: 0.8 (accuracy metric) - Morphological features: 1.0 (accuracy metric) - UPOS: 0.5 (accuracy metric) - XPOS: 0.5 (accuracy metric) - **Layers used for decoding:** last 3 layers `[-1, -2, -3]` #### Speeds, Sizes, Times - **Max GPU memory used:** 5.67 GB - **CPU RAM usage:** ~2.14 GB - **Average epoch time:** ~28 seconds - **Total training time:** ~29 minutes 31 seconds - **Final checkpoint size:** ~1.06 GB (`model.safetensors`) ## Evaluation | Task | Dev Accuracy | |-------------|--------------| | Dependency (LAS) | 0.6370 | | Lemma | 0.8618 | | Morph | 0.7689 | | UPOS | 0.9145 | | XPOS | 0.8984 | #### Metrics The model was evaluated on the UD_Ottoman_Turkish-DUDU dev set using task-appropriate metrics for each annotation type: - **Labeled Attachment Score (LAS)** for dependency parsing — measures the percentage of tokens that are assigned both the correct head and dependency relation label. - **Accuracy** for UPOS (Universal POS) tagging — proportion of tokens with correctly predicted universal POS tags. - **Accuracy** for XPOS (Language-specific POS) tagging — proportion of tokens with correctly predicted language-specific POS tags. - **Accuracy** for Morphological Features — proportion of tokens with all morphological features predicted exactly. - **Accuracy** for Lemmatization — proportion of tokens with the correct base form predicted. ## Citation Yılandiloğlu, E., & Siewert, J. (2025). DUDU: A Treebank for Ottoman Turkish in UD Style. In Š. A. Holdt, N. Ilinykh, B. Scalvini, M. Bruton, I. N. Debess, & C. M. Tudor (Eds.), The Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL 2025) (pp. 74–79). University of Tartu Library. ## Model Card Authors Enes Yılandiloğlu ## Model Card Contact enes.yilandiloglu@helsinki.fi