Meitei-RoBERTa-Base (Monolingual, Bengali Script)
The Meitei-RoBERTa-Base model is a high-performance, monolingual transformer encoder pre-trained from scratch on the entire Meitei Monolingual Corpus (MWirelabs/meitei-monolingual-corpus). It is developed using the RoBERTa training methodology, establishing a foundational language representation for Meitei (Manipuri) in Bengali script.
This model serves as a robust backbone for accelerating downstream NLP tasks such as Named Entity Recognition (NER), Sentiment Analysis, and Text Classification in Meitei.
Model Architecture & Details
The architecture is based on the highly effective RoBERTa Base configuration, ensuring maximum performance while managing computational complexity.
Configuration
| Component | Detail | Specification |
|---|---|---|
| Architecture | RoBERTa Base Encoder | 12 Layers, 12 Attention Heads |
| Hidden Dimension | 768 | Standard Base Size |
| Total Parameters | 125,000,000 | (125 Million) |
| Max Context Length | 512 Tokens | Maximum sequence length (optimized for GPU memory) |
Tokenizer Details
| Component | Detail | Specification |
|---|---|---|
| Tokenizer Type | Byte-Level Byte Pair Encoding (BPE) | Robust to handle complex morphology and unseen characters inherent in Indic scripts. |
| Vocabulary Size | 52,000 Tokens | Custom-trained on the corpus for optimal subword efficiency and low Out-of-Vocabulary (OOV) rate. |
| Special Tokens | <s>, </s>, <unk>, <pad>, <mask> |
RoBERTa standard. |
Pre-training Details
The model was trained from a randomly initialized state on the full corpus using the Masked Language Modeling (MLM) objective.
Training Parameters
| Parameter | Value | Rationale |
|---|---|---|
| Training Corpus | MWirelabs/meitei-monolingual-corpus (Train Split) | High-quality, estimated 76M+ word corpus. |
| Training Task | Masked Language Modeling (MLM) | RoBERTa's core objective (15% mask probability). |
| Data Size (Chunks) | 353,123 blocks of 512 tokens | Full utilization of available corpus data. |
| Effective Batch Size | 256 | Optimized for high-throughput pre-training. |
| Learning Rate | 6e-4 | Standard aggressive rate with linear decay and warmup. |
| Total Epochs | 3 | Full training run until convergence. |
| Final Training Loss | 4.185500 | Confirms successful learning (significant reduction from initial random loss). |
Training Metrics & Loss Curve
The model demonstrates strong convergence, successfully learning the grammatical and semantic structure of Meitei.
The full log history, including training loss and learning rate evolution, is available in the repository for detailed analysis:
Log File:
training_loss_history.csvMetric: Training Loss (
loss) against Training Step (step).
Evaluation Metrics
| Metric | Value | Description |
|---|---|---|
| Final Training Loss | 4.185500 | Recorded loss at the final training step. |
| Perplexity (PPL) | 65.89 | Calculated PPL on a held-out validation set, indicating strong language fluency. |
| PPL vs. Baselines | 5.4x better | This model (PPL 65.89) performed 5.4 times better at predicting Meitei text than MuRIL (PPL 355.65), proving the value of custom pre-training. |
Comparative Performance
| Model | Evaluation Loss | Perplexity (PPL) |
|---|---|---|
| Meitei-RoBERTa (Custom) | 4.1880 | 65.89 |
| mBERT (Baseline) | 5.8335 | 341.56 |
| MuRIL (Baseline) | 5.8740 | 355.65 |
The full log history, including training loss and learning rate evolution, is available in the repository for detailed analysis:
- Log File:
training_loss_history.csv
💡 How to Use (For Inference and Fine-tuning)
This model can be loaded directly into any Hugging Face pipeline or used as the encoder in a custom model for fine-tuning.
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
# Using the lowercase repository ID for robust loading
REPO_ID = "MWirelabs/meitei-roberta"
# 1. Load Model and Tokenizer
tokenizer = AutoTokenizer.from_pretrained(REPO_ID)
model = AutoModelForMaskedLM.from_pretrained(REPO_ID)
# 2. Example: Tokenize Text
# NOTE: Example text MUST be in the Bengali script, as the model was trained only on this script.
meitei_text = "আমি গতকাল স্কুল থেকে ফিরেছি। এই বইটি পড়তে ভাল লাগে।"
inputs = tokenizer(meitei_text, return_tensors="pt")
# 3. Example: Fill-Mask Pipeline Test (Testing fluency)
unmasker = pipeline("fill-mask", model=model, tokenizer=tokenizer)
# Masking a word in Bengali script Meitei
results = unmasker("আমাদের দেশে <mask> অনেক সমস্যা আছে।")
# ... process results
Intended Use
This model is primarily intended for research purposes and as a foundational encoder for Meitei NLP downstream tasks, including:
Fine-tuning on sequence classification, token classification (NER), and summarization.
Feature extraction to generate high-quality Meitei text embeddings. The model is not intended for deployment in applications that require safety-critical decision-making without further domain-specific fine-tuning and validation.
Limitations and Bias
Script Dependence: This model was exclusively trained on the Bengali script version of Meitei and will perform poorly on Meitei text written in the Meitei Mayek (Meetei Mayek) script.
Monolingual Focus: The model is not suitable for cross-lingual tasks without further fine-tuning.
📚 Citation
If you use this model or the Meitei corpus in your work, please cite it as:
@misc{mwirelabs_meitei_roberta_2025,
title = {Meitei-RoBERTa-Base (Bengali Script) Model},
author = {MWire Labs},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{[https://huggingface.co/MWirelabs/meitei-roberta](https://huggingface.co/MWirelabs/meitei-roberta)}},
note = {RoBERTa Base pre-trained from scratch on the Meitei Monolingual Corpus}
}
About MWire Labs
MWire Labs builds ethical, region-first AI infrastructure for Northeast India—focusing on low-resource languages and public accessibility.
Learn more at www.mwirelabs.com
Contributions & Feedback
We welcome feedback, contributions, and civic collaborations.
Reach out via Hugging Face.
- Downloads last month
- 48
Dataset used to train MWirelabs/meitei-roberta
Collection including MWirelabs/meitei-roberta
Evaluation results
- Training Loss on Meitei Monolingual Corpusself-reported4.186
- Perplexity on Meitei Monolingual Corpusself-reported65.890