You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Kren-M™: Khasi–English Bilingual Language Model

Kren-M “Kren-M is a bilingual (Khasi–English) language model developed through extensive continued pre-training and supervised fine-tuning of Gemma 2 (2B). Specifically designed for the Khasi language — a low-resource Austroasiatic language spoken in Meghalaya, Northeast India — while retaining English fluency from its base model.


Model Overview

  • Base Model: google/gemma-2-2b
  • Architecture: 2.6B parameters
  • Languages: Khasi, English
  • Context Length: 2048 tokens
  • Precision: BFloat16
  • License: Research Non-Commercial (inherits Gemma license)

Key Highlights

  • Bilingual understanding: Effective generation in Khasi and English
  • Translation: Bidirectional English↔Khasi
  • Conversation: Natural dialogue in Khasi with cultural tone
  • Efficiency: 35.7% fewer tokens via custom tokenizer

Training Summary

Phase 1: Tokenizer Extension

  • Base: Gemma-2-2B tokenizer (SentencePiece)
  • Added Tokens: 2,135 Khasi-specific subwords
  • Efficiency Gain: 35.7% fewer tokens (avg 101 vs 157)

Phase 2: Continued Pre-Training (CPT)

  • Corpus: 5.43M cleaned Khasi sentences (~521M tokens)
  • Epochs: 2 | Duration: 4 days (NVIDIA A40)
  • Loss: 6.77 → 2.99 | Perplexity: ~19.9

Phase 3: Supervised Fine-Tuning (SFT)

  • Dataset: 42,977 instruction pairs

    • 20K Translation (Khasi↔English)
    • 15K English Chat (Databricks Dolly)
    • 7,977 Khasi Chat (Native corpus)
  • Method: LoRA + Gemma chat template

  • Loss: 2.38 → 1.08 (train)

  • Final Model: MWirelabs/Kren-M


Capabilities

Translation – Accurate English↔Khasi with explicit instructions Conversation – Context-aware Khasi dialogue Language Switching – Responds in correct language automatically Cultural Context – Aware of local references like Shillong, Umïam, etc.

Example Prompts:

Translate to Khasi: Hello → Ka jingpdiang sngewbha ia phi.
Translate to English: Khublei shibun → Thank you.
Respond in Khasi: Kumno phi long mynta ka sngi? → Khublei shibun, nga don ha ka bor bad nga don ki thong kiba thymmai ban poi!

Technical Specs

Attribute Value
Base Model Gemma-2-2B
Parameters ~2.6B
Vocabulary 258,135 tokens
Precision BFloat16
Memory (Inference) ~6GB
LoRA Params (CPT) ~41M
LoRA Params (SFT) ~52M

Validation Summary

  • Correct EOS termination: 95%+
  • Controlled bilingual behavior (no unwanted translation)
  • Minor verbosity in long responses
  • Some factual gaps inherited from Gemma base

Usage Example

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("MWirelabs/Kren-M", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("MWirelabs/Kren-M")

prompt = "<start_of_turn>user\nTranslate to Khasi: Hello, how are you?<end_of_turn>\n<start_of_turn>model\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Use

  • Khasi language education & preservation
  • English↔Khasi translation systems
  • Conversational AI for Northeast India
  • Research on low-resource & endangered languages

Limitations & Ethics

  • Limited colloquial coverage (trained mainly on written Khasi)
  • May not capture all dialectal variations
  • Knowledge cutoff inherited from Gemma-2-2B
  • Released for research & non-commercial use only

Ethical Note: Kren-M supports language preservation and digital inclusion for Khasi — a language recognized as vulnerable by UNESCO.


Citation

@misc{kren-m-2025,
  title={Kren-M: A Bilingual Language Model for Khasi},
  author={MWire Labs},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/MWirelabs/Kren-M}
}

Developed by [MWire Labs, Shillong] https://mwirelabs.com | #KrenM Part of Northeast India’s initiative for AI-driven language preservation.

Downloads last month
10
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MWirelabs/Kren-M

Base model

google/gemma-2-2b
Finetuned
(362)
this model

Collection including MWirelabs/Kren-M