Kren-M™: Khasi–English Bilingual Language Model
Kren-M “Kren-M is a bilingual (Khasi–English) language model developed through extensive continued pre-training and supervised fine-tuning of Gemma 2 (2B). Specifically designed for the Khasi language — a low-resource Austroasiatic language spoken in Meghalaya, Northeast India — while retaining English fluency from its base model.
Model Overview
- Base Model: google/gemma-2-2b
- Architecture: 2.6B parameters
- Languages: Khasi, English
- Context Length: 2048 tokens
- Precision: BFloat16
- License: Research Non-Commercial (inherits Gemma license)
Key Highlights
- Bilingual understanding: Effective generation in Khasi and English
- Translation: Bidirectional English↔Khasi
- Conversation: Natural dialogue in Khasi with cultural tone
- Efficiency: 35.7% fewer tokens via custom tokenizer
Training Summary
Phase 1: Tokenizer Extension
- Base: Gemma-2-2B tokenizer (SentencePiece)
- Added Tokens: 2,135 Khasi-specific subwords
- Efficiency Gain: 35.7% fewer tokens (avg 101 vs 157)
Phase 2: Continued Pre-Training (CPT)
- Corpus: 5.43M cleaned Khasi sentences (~521M tokens)
- Epochs: 2 | Duration: 4 days (NVIDIA A40)
- Loss: 6.77 → 2.99 | Perplexity: ~19.9
Phase 3: Supervised Fine-Tuning (SFT)
Dataset: 42,977 instruction pairs
- 20K Translation (Khasi↔English)
- 15K English Chat (Databricks Dolly)
- 7,977 Khasi Chat (Native corpus)
Method: LoRA + Gemma chat template
Loss: 2.38 → 1.08 (train)
Final Model:
MWirelabs/Kren-M
Capabilities
Translation – Accurate English↔Khasi with explicit instructions Conversation – Context-aware Khasi dialogue Language Switching – Responds in correct language automatically Cultural Context – Aware of local references like Shillong, Umïam, etc.
Example Prompts:
Translate to Khasi: Hello → Ka jingpdiang sngewbha ia phi.
Translate to English: Khublei shibun → Thank you.
Respond in Khasi: Kumno phi long mynta ka sngi? → Khublei shibun, nga don ha ka bor bad nga don ki thong kiba thymmai ban poi!
Technical Specs
| Attribute | Value |
|---|---|
| Base Model | Gemma-2-2B |
| Parameters | ~2.6B |
| Vocabulary | 258,135 tokens |
| Precision | BFloat16 |
| Memory (Inference) | ~6GB |
| LoRA Params (CPT) | ~41M |
| LoRA Params (SFT) | ~52M |
Validation Summary
- Correct EOS termination: 95%+
- Controlled bilingual behavior (no unwanted translation)
- Minor verbosity in long responses
- Some factual gaps inherited from Gemma base
Usage Example
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("MWirelabs/Kren-M", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("MWirelabs/Kren-M")
prompt = "<start_of_turn>user\nTranslate to Khasi: Hello, how are you?<end_of_turn>\n<start_of_turn>model\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Intended Use
- Khasi language education & preservation
- English↔Khasi translation systems
- Conversational AI for Northeast India
- Research on low-resource & endangered languages
Limitations & Ethics
- Limited colloquial coverage (trained mainly on written Khasi)
- May not capture all dialectal variations
- Knowledge cutoff inherited from Gemma-2-2B
- Released for research & non-commercial use only
Ethical Note: Kren-M supports language preservation and digital inclusion for Khasi — a language recognized as vulnerable by UNESCO.
Citation
@misc{kren-m-2025,
title={Kren-M: A Bilingual Language Model for Khasi},
author={MWire Labs},
year={2025},
publisher={HuggingFace},
url={https://huggingface.co/MWirelabs/Kren-M}
}
Developed by [MWire Labs, Shillong] https://mwirelabs.com | #KrenM Part of Northeast India’s initiative for AI-driven language preservation.
- Downloads last month
- 10
Model tree for MWirelabs/Kren-M
Base model
google/gemma-2-2b