Fine-tuned XLM-RoBERTa for Multilingual Language Detection

This model is a fine-tuned version of FacebookAI/xlm-roberta-base on a custom multilingual dataset for the task of language detection.

Model Description

XLM-RoBERTa is a transformer-based model pre-trained on 100+ languages using a masked language modeling (MLM) objective. This fine-tuned version adapts the base model to classify short text snippets into one of 36 languages.

Intended Uses & Limitations

Use Cases:

  • Language detection in multilingual NLP pipelines
  • Preprocessing step for multilingual text classification or translation systems
  • Educational use for understanding fine-tuning workflows

Limitations:

  • Designed for sentence-level or short-paragraph inputs; may degrade on long-form text
  • Trained on a limited language set (36 classes); may not generalize beyond that
  • Not suitable for low-resource or unseen languages

Training and Evaluation Data

The model was trained on a balanced multilingual dataset, containing short text snippets labeled with one of 36 language classes. The dataset was manually curated and tokenized using Hugging Face's tokenizer utilities.

A separate validation set was used to monitor F1, accuracy, precision, and recall during training.

Training Procedure

Hyperparameters

The following hyperparameters were used during fine-tuning:

  • Learning rate: 2e-5
  • Train batch size: 40
  • Eval batch size: 40
  • Number of epochs: 3
  • Weight decay: 0.01
  • Seed: 42
  • Optimizer: AdamW (betas=(0.9, 0.999), epsilon=1e-08)
  • Learning rate scheduler: Linear

Evaluation Metrics

Model performance was evaluated using:

  • Accuracy
  • F1 Score (macro average)
  • Precision (macro)
  • Recall (macro)

Framework Versions

  • 🤗 Transformers: v4.52.4
  • PyTorch: v2.6.0+cu124
  • Datasets: v3.6.0
  • Tokenizers: v0.21.2
Downloads last month
1
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for minhleduc/xlm-roberta-multilang-finetuned-00

Finetuned
(3590)
this model

Dataset used to train minhleduc/xlm-roberta-multilang-finetuned-00