Meitei-RoBERTa-Base (Monolingual, Bengali Script)

The Meitei-RoBERTa-Base model is a high-performance, monolingual transformer encoder pre-trained from scratch on the entire Meitei Monolingual Corpus (MWirelabs/meitei-monolingual-corpus). It is developed using the RoBERTa training methodology, establishing a foundational language representation for Meitei (Manipuri) in Bengali script.

This model serves as a robust backbone for accelerating downstream NLP tasks such as Named Entity Recognition (NER), Sentiment Analysis, and Text Classification in Meitei.

Model Architecture & Details

The architecture is based on the highly effective RoBERTa Base configuration, ensuring maximum performance while managing computational complexity.

Configuration

Component Detail Specification
Architecture RoBERTa Base Encoder 12 Layers, 12 Attention Heads
Hidden Dimension 768 Standard Base Size
Total Parameters 125,000,000 (125 Million)
Max Context Length 512 Tokens Maximum sequence length (optimized for GPU memory)

Tokenizer Details

Component Detail Specification
Tokenizer Type Byte-Level Byte Pair Encoding (BPE) Robust to handle complex morphology and unseen characters inherent in Indic scripts.
Vocabulary Size 52,000 Tokens Custom-trained on the corpus for optimal subword efficiency and low Out-of-Vocabulary (OOV) rate.
Special Tokens <s>, </s>, <unk>, <pad>, <mask> RoBERTa standard.

Pre-training Details

The model was trained from a randomly initialized state on the full corpus using the Masked Language Modeling (MLM) objective.

Training Parameters

Parameter Value Rationale
Training Corpus MWirelabs/meitei-monolingual-corpus (Train Split) High-quality, estimated 76M+ word corpus.
Training Task Masked Language Modeling (MLM) RoBERTa's core objective (15% mask probability).
Data Size (Chunks) 353,123 blocks of 512 tokens Full utilization of available corpus data.
Effective Batch Size 256 Optimized for high-throughput pre-training.
Learning Rate 6e-4 Standard aggressive rate with linear decay and warmup.
Total Epochs 3 Full training run until convergence.
Final Training Loss 4.185500 Confirms successful learning (significant reduction from initial random loss).

Training Metrics & Loss Curve

The model demonstrates strong convergence, successfully learning the grammatical and semantic structure of Meitei.

The full log history, including training loss and learning rate evolution, is available in the repository for detailed analysis:

  • Log File: training_loss_history.csv

  • Metric: Training Loss (loss) against Training Step (step).


Evaluation Metrics

Metric Value Description
Final Training Loss 4.185500 Recorded loss at the final training step.
Perplexity (PPL) 65.89 Calculated PPL on a held-out validation set, indicating strong language fluency.
PPL vs. Baselines 5.4x better This model (PPL 65.89) performed 5.4 times better at predicting Meitei text than MuRIL (PPL 355.65), proving the value of custom pre-training.

Comparative Performance

Model Evaluation Loss Perplexity (PPL)
Meitei-RoBERTa (Custom) 4.1880 65.89
mBERT (Baseline) 5.8335 341.56
MuRIL (Baseline) 5.8740 355.65

The full log history, including training loss and learning rate evolution, is available in the repository for detailed analysis:

  • Log File: training_loss_history.csv

💡 How to Use (For Inference and Fine-tuning)

This model can be loaded directly into any Hugging Face pipeline or used as the encoder in a custom model for fine-tuning.

from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

# Using the lowercase repository ID for robust loading
REPO_ID = "MWirelabs/meitei-roberta" 

# 1. Load Model and Tokenizer
tokenizer = AutoTokenizer.from_pretrained(REPO_ID)
model = AutoModelForMaskedLM.from_pretrained(REPO_ID)

# 2. Example: Tokenize Text
# NOTE: Example text MUST be in the Bengali script, as the model was trained only on this script.
meitei_text = "আমি গতকাল স্কুল থেকে ফিরেছি। এই বইটি পড়তে ভাল লাগে।" 
inputs = tokenizer(meitei_text, return_tensors="pt")

# 3. Example: Fill-Mask Pipeline Test (Testing fluency)
unmasker = pipeline("fill-mask", model=model, tokenizer=tokenizer)
# Masking a word in Bengali script Meitei
results = unmasker("আমাদের দেশে <mask> অনেক সমস্যা আছে।") 
# ... process results

Intended Use

This model is primarily intended for research purposes and as a foundational encoder for Meitei NLP downstream tasks, including:

Fine-tuning on sequence classification, token classification (NER), and summarization.

Feature extraction to generate high-quality Meitei text embeddings. The model is not intended for deployment in applications that require safety-critical decision-making without further domain-specific fine-tuning and validation.

Limitations and Bias

Script Dependence: This model was exclusively trained on the Bengali script version of Meitei and will perform poorly on Meitei text written in the Meitei Mayek (Meetei Mayek) script.

Monolingual Focus: The model is not suitable for cross-lingual tasks without further fine-tuning.

📚 Citation

If you use this model or the Meitei corpus in your work, please cite it as:

@misc{mwirelabs_meitei_roberta_2025,
  title        = {Meitei-RoBERTa-Base (Bengali Script) Model},
  author       = {MWire Labs},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{[https://huggingface.co/MWirelabs/meitei-roberta](https://huggingface.co/MWirelabs/meitei-roberta)}},
  note         = {RoBERTa Base pre-trained from scratch on the Meitei Monolingual Corpus}
}

About MWire Labs

MWire Labs builds ethical, region-first AI infrastructure for Northeast India—focusing on low-resource languages and public accessibility.

Learn more at www.mwirelabs.com


Contributions & Feedback

We welcome feedback, contributions, and civic collaborations.
Reach out via Hugging Face.

Downloads last month
48
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train MWirelabs/meitei-roberta

Collection including MWirelabs/meitei-roberta

Evaluation results