Fine-tuned XLM-RoBERTa for Multilingual Language Detection
This model is a fine-tuned version of FacebookAI/xlm-roberta-base on a custom multilingual dataset for the task of language detection.
Model Description
XLM-RoBERTa is a transformer-based model pre-trained on 100+ languages using a masked language modeling (MLM) objective. This fine-tuned version adapts the base model to classify short text snippets into one of 36 languages.
Intended Uses & Limitations
Use Cases:
- Language detection in multilingual NLP pipelines
- Preprocessing step for multilingual text classification or translation systems
- Educational use for understanding fine-tuning workflows
Limitations:
- Designed for sentence-level or short-paragraph inputs; may degrade on long-form text
- Trained on a limited language set (36 classes); may not generalize beyond that
- Not suitable for low-resource or unseen languages
Training and Evaluation Data
The model was trained on a balanced multilingual dataset, containing short text snippets labeled with one of 36 language classes. The dataset was manually curated and tokenized using Hugging Face's tokenizer utilities.
A separate validation set was used to monitor F1, accuracy, precision, and recall during training.
Training Procedure
Hyperparameters
The following hyperparameters were used during fine-tuning:
- Learning rate:
2e-5 - Train batch size:
40 - Eval batch size:
40 - Number of epochs:
3 - Weight decay:
0.01 - Seed:
42 - Optimizer: AdamW (
betas=(0.9, 0.999),epsilon=1e-08) - Learning rate scheduler: Linear
Evaluation Metrics
Model performance was evaluated using:
- Accuracy
- F1 Score (macro average)
- Precision (macro)
- Recall (macro)
Framework Versions
- 🤗 Transformers:
v4.52.4 - PyTorch:
v2.6.0+cu124 - Datasets:
v3.6.0 - Tokenizers:
v0.21.2
- Downloads last month
- 1
Model tree for minhleduc/xlm-roberta-multilang-finetuned-00
Base model
FacebookAI/xlm-roberta-base