Meitei-RoBERTa-Base (Monolingual, Bengali Script)

The Meitei-RoBERTa-Base model is a high-performance, monolingual transformer encoder pre-trained from scratch on the entire Meitei Monolingual Corpus (MWirelabs/meitei-monolingual-corpus). It is developed using the RoBERTa training methodology, establishing a foundational language representation for Meitei (Manipuri) in Bengali script.

This model serves as a robust backbone for accelerating downstream NLP tasks such as Named Entity Recognition (NER), Sentiment Analysis, and Text Classification in Meitei.

Model Architecture & Details

The architecture is based on the highly effective RoBERTa Base configuration, ensuring maximum performance while managing computational complexity.

Configuration

Component	Detail	Specification
Architecture	RoBERTa Base Encoder	12 Layers, 12 Attention Heads
Hidden Dimension	768	Standard Base Size
Total Parameters	125,000,000	(125 Million)
Max Context Length	512 Tokens	Maximum sequence length (optimized for GPU memory)

Tokenizer Details

Component	Detail	Specification
Tokenizer Type	Byte-Level Byte Pair Encoding (BPE)	Robust to handle complex morphology and unseen characters inherent in Indic scripts.
Vocabulary Size	52,000 Tokens	Custom-trained on the corpus for optimal subword efficiency and low Out-of-Vocabulary (OOV) rate.
Special Tokens	`<s>`, `</s>`, `<unk>`, `<pad>`, `<mask>`	RoBERTa standard.

Pre-training Details

The model was trained from a randomly initialized state on the full corpus using the Masked Language Modeling (MLM) objective.

Training Parameters

Parameter	Value	Rationale
Training Corpus	MWirelabs/meitei-monolingual-corpus (Train Split)	High-quality, estimated 76M+ word corpus.
Training Task	Masked Language Modeling (MLM)	RoBERTa's core objective (15% mask probability).
Data Size (Chunks)	353,123 blocks of 512 tokens	Full utilization of available corpus data.
Effective Batch Size	256	Optimized for high-throughput pre-training.
Learning Rate	6e-4	Standard aggressive rate with linear decay and warmup.
Total Epochs	3	Full training run until convergence.
Final Training Loss	4.185500	Confirms successful learning (significant reduction from initial random loss).

Training Metrics & Loss Curve

The model demonstrates strong convergence, successfully learning the grammatical and semantic structure of Meitei.

The full log history, including training loss and learning rate evolution, is available in the repository for detailed analysis:

Log File: training_loss_history.csv
Metric: Training Loss (loss) against Training Step (step).

Evaluation Metrics

Metric	Value	Description
Final Training Loss	4.185500	Recorded loss at the final training step.
Perplexity (PPL)	65.89	Calculated PPL on a held-out validation set, indicating strong language fluency.
PPL vs. Baselines	5.4x better	This model (PPL 65.89) performed 5.4 times better at predicting Meitei text than MuRIL (PPL 355.65), proving the value of custom pre-training.

Comparative Performance

Model	Evaluation Loss	Perplexity (PPL)
Meitei-RoBERTa (Custom)	4.1880	65.89
mBERT (Baseline)	5.8335	341.56
MuRIL (Baseline)	5.8740	355.65

The full log history, including training loss and learning rate evolution, is available in the repository for detailed analysis:

Log File: training_loss_history.csv

💡 How to Use (For Inference and Fine-tuning)

This model can be loaded directly into any Hugging Face pipeline or used as the encoder in a custom model for fine-tuning.

from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

# Using the lowercase repository ID for robust loading
REPO_ID = "MWirelabs/meitei-roberta" 

# 1. Load Model and Tokenizer
tokenizer = AutoTokenizer.from_pretrained(REPO_ID)
model = AutoModelForMaskedLM.from_pretrained(REPO_ID)

# 2. Example: Tokenize Text
# NOTE: Example text MUST be in the Bengali script, as the model was trained only on this script.
meitei_text = "আমি গতকাল স্কুল থেকে ফিরেছি। এই বইটি পড়তে ভাল লাগে।" 
inputs = tokenizer(meitei_text, return_tensors="pt")

# 3. Example: Fill-Mask Pipeline Test (Testing fluency)
unmasker = pipeline("fill-mask", model=model, tokenizer=tokenizer)
# Masking a word in Bengali script Meitei
results = unmasker("আমাদের দেশে <mask> অনেক সমস্যা আছে।") 
# ... process results

Intended Use

This model is primarily intended for research purposes and as a foundational encoder for Meitei NLP downstream tasks, including:

Fine-tuning on sequence classification, token classification (NER), and summarization.

Feature extraction to generate high-quality Meitei text embeddings. The model is not intended for deployment in applications that require safety-critical decision-making without further domain-specific fine-tuning and validation.

Limitations and Bias

Script Dependence: This model was exclusively trained on the Bengali script version of Meitei and will perform poorly on Meitei text written in the Meitei Mayek (Meetei Mayek) script.

Monolingual Focus: The model is not suitable for cross-lingual tasks without further fine-tuning.

📚 Citation

If you use this model or the Meitei corpus in your work, please cite it as:

@misc{mwirelabs_meitei_roberta_2025,
  title        = {Meitei-RoBERTa-Base (Bengali Script) Model},
  author       = {MWire Labs},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{[https://huggingface.co/MWirelabs/meitei-roberta](https://huggingface.co/MWirelabs/meitei-roberta)}},
  note         = {RoBERTa Base pre-trained from scratch on the Meitei Monolingual Corpus}
}