You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Kren-M™: Khasi–English Bilingual Language Model

Kren-M “Kren-M is a bilingual (Khasi–English) language model developed through extensive continued pre-training and supervised fine-tuning of Gemma 2 (2B). Specifically designed for the Khasi language — a low-resource Austroasiatic language spoken in Meghalaya, Northeast India — while retaining English fluency from its base model.

Model Overview

Base Model: google/gemma-2-2b
Architecture: 2.6B parameters
Languages: Khasi, English
Context Length: 2048 tokens
Precision: BFloat16
License: Research Non-Commercial (inherits Gemma license)

Key Highlights

Bilingual understanding: Effective generation in Khasi and English
Translation: Bidirectional English↔Khasi
Conversation: Natural dialogue in Khasi with cultural tone
Efficiency: 35.7% fewer tokens via custom tokenizer

Training Summary

Phase 1: Tokenizer Extension

Base: Gemma-2-2B tokenizer (SentencePiece)
Added Tokens: 2,135 Khasi-specific subwords
Efficiency Gain: 35.7% fewer tokens (avg 101 vs 157)

Phase 2: Continued Pre-Training (CPT)

Corpus: 5.43M cleaned Khasi sentences (~521M tokens)
Epochs: 2 | Duration: 4 days (NVIDIA A40)
Loss: 6.77 → 2.99 | Perplexity: ~19.9

Phase 3: Supervised Fine-Tuning (SFT)

Dataset: 42,977 instruction pairs
- 20K Translation (Khasi↔English)
- 15K English Chat (Databricks Dolly)
- 7,977 Khasi Chat (Native corpus)
Method: LoRA + Gemma chat template
Loss: 2.38 → 1.08 (train)
Final Model: MWirelabs/Kren-M

Capabilities

Translation – Accurate English↔Khasi with explicit instructions Conversation – Context-aware Khasi dialogue Language Switching – Responds in correct language automatically Cultural Context – Aware of local references like Shillong, Umïam, etc.

Example Prompts:

Translate to Khasi: Hello → Ka jingpdiang sngewbha ia phi.
Translate to English: Khublei shibun → Thank you.
Respond in Khasi: Kumno phi long mynta ka sngi? → Khublei shibun, nga don ha ka bor bad nga don ki thong kiba thymmai ban poi!

Technical Specs

Attribute	Value
Base Model	Gemma-2-2B
Parameters	~2.6B
Vocabulary	258,135 tokens
Precision	BFloat16
Memory (Inference)	~6GB
LoRA Params (CPT)	~41M
LoRA Params (SFT)	~52M

Validation Summary

Correct EOS termination: 95%+
Controlled bilingual behavior (no unwanted translation)
Minor verbosity in long responses
Some factual gaps inherited from Gemma base

Usage Example

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("MWirelabs/Kren-M", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("MWirelabs/Kren-M")

prompt = "<start_of_turn>user\nTranslate to Khasi: Hello, how are you?<end_of_turn>\n<start_of_turn>model\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Use

Khasi language education & preservation
English↔Khasi translation systems
Conversational AI for Northeast India
Research on low-resource & endangered languages

Limitations & Ethics

Limited colloquial coverage (trained mainly on written Khasi)
May not capture all dialectal variations
Knowledge cutoff inherited from Gemma-2-2B
Released for research & non-commercial use only

Ethical Note: Kren-M supports language preservation and digital inclusion for Khasi — a language recognized as vulnerable by UNESCO.

Citation

@misc{kren-m-2025,
  title={Kren-M: A Bilingual Language Model for Khasi},
  author={MWire Labs},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/MWirelabs/Kren-M}
}

Developed by [MWire Labs, Shillong] https://mwirelabs.com | #KrenM Part of Northeast India’s initiative for AI-driven language preservation.