Fine-tuned XLM-RoBERTa for Multilingual Language Detection

This model is a fine-tuned version of FacebookAI/xlm-roberta-base on a custom multilingual dataset for the task of language detection.

Model Description

XLM-RoBERTa is a transformer-based model pre-trained on 100+ languages using a masked language modeling (MLM) objective. This fine-tuned version adapts the base model to classify short text snippets into one of 36 languages.

Intended Uses & Limitations

Use Cases:

Language detection in multilingual NLP pipelines
Preprocessing step for multilingual text classification or translation systems
Educational use for understanding fine-tuning workflows

Limitations:

Designed for sentence-level or short-paragraph inputs; may degrade on long-form text
Trained on a limited language set (36 classes); may not generalize beyond that
Not suitable for low-resource or unseen languages

Training and Evaluation Data

The model was trained on a balanced multilingual dataset, containing short text snippets labeled with one of 36 language classes. The dataset was manually curated and tokenized using Hugging Face's tokenizer utilities.

A separate validation set was used to monitor F1, accuracy, precision, and recall during training.

Training Procedure

Hyperparameters

The following hyperparameters were used during fine-tuning:

Learning rate: 2e-5
Train batch size: 40
Eval batch size: 40
Number of epochs: 3
Weight decay: 0.01
Seed: 42
Optimizer: AdamW (betas=(0.9, 0.999), epsilon=1e-08)
Learning rate scheduler: Linear

Evaluation Metrics

Model performance was evaluated using:

Accuracy
F1 Score (macro average)
Precision (macro)
Recall (macro)

Framework Versions

🤗 Transformers: v4.52.4
PyTorch: v2.6.0+cu124
Datasets: v3.6.0
Tokenizers: v0.21.2

Downloads last month: 1

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for minhleduc/xlm-roberta-multilang-finetuned-00

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3590)

this model

Dataset used to train minhleduc/xlm-roberta-multilang-finetuned-00

Evaluation results

Metadata error: specify a dataset to view leaderboard