Note: This project represents a series of experimental attempts to create specialized Sanskrit language models. While significant effort was invested, the results did not meet the quality standards required for practical use. This repository is maintained for research documentation and learning purposes.
Sanskrit Qwen2.5-7B Chat Model
This model is a fine-tuned version of diabolic6045/Sanskrit-Qwen2.5-7B-base specialized for Sanskrit language tasks including bidirectional translation (Sanskrit ↔ English) and transliteration (Devanagari to IAST). The model has been trained in two phases: initial Sanskrit language modeling followed by chat template enhancement for translation and transliteration capabilities.
Model Description
This is a specialized version of Qwen2.5-7B that has been enhanced for Sanskrit language processing through a two-phase training approach:
Phase 1: Sanskrit Language Model Training
- Trained on 664,104+ clean Sanskrit texts from 9 different sources
- Enhanced with Sanskrit-specific punctuation tokens (। ॥)
- Custom tokenizer with proper Devanagari script support
Phase 2: Chat Template Enhancement
- Post-trained on 94,363+ bidirectional translation and transliteration examples
- Chat template format for conversation-style interactions
- Specialized system prompts for different task types
Key Features:
- Base Model: Qwen/Qwen2.5-7B (7 billion parameters)
- Training Method: Two-phase LoRA fine-tuning
- Languages: Sanskrit (Devanagari script) and English
- Tasks: Translation, Transliteration, Text Generation
- Architecture: Transformer-based with LoRA adapters
Capabilities:
- Bidirectional Translation: Sanskrit ↔ English translation
- Transliteration: Devanagari script to IAST format
- Sanskrit Text Generation: Coherent Sanskrit verse and prose generation
- Chat Interface: Conversational responses with task understanding
- Context Preservation: Maintains meaning during translation tasks
Training and Evaluation Data
Training Datasets
Phase 1 - Sanskrit Language Model:
- Dataset: diabolic6045/Sanskrit-shlok-collection
- Size: 664,104 unique Sanskrit texts
- Sources: 9 different Sanskrit literature sources
- Format: Clean text completion format
- Content: Classical Sanskrit texts, verses, and prose
Phase 2 - Chat Enhancement:
- Dataset: diabolic6045/Sanskrit-transliteration-chat-dataset
- Size: 94,363 conversation examples
- Tasks:
- Sanskrit to English translation: 20,205 examples
- English to Sanskrit translation: 20,205 examples
- Transliteration (Devanagari to IAST): 53,953 examples
- Format: Chat template with system prompts
Evaluation Results
The model was evaluated on a comprehensive test set of 500 samples from Test Dataset across three task types:
| Task Type | Samples | Exact Match | Jaccard Similarity | BLEU Score | Character Accuracy |
|---|---|---|---|---|---|
| Sanskrit → English | 114 | 0.0% | 30.9% | 10.9% | - |
| English → Sanskrit | 101 | 0.0% | 10.3% | 4.1% | - |
| Transliteration | 285 | 29.1% | - | - | 96.7% |
| Overall | 500 | - | - | - | - |
Key Insights:
- 100% Success Rate: All test samples processed without failures
- High Transliteration Accuracy: 96.7% character-level accuracy for Devanagari to IAST conversion
- Moderate Translation Performance: Room for improvement in exact translation matching
- Strong Character Recognition: Excellent handling of Sanskrit script and punctuation
Usage
Basic Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer
model_name = "diabolic6045/Sanskrit-Qwen2.5-7B-chat"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
# Sanskrit to English Translation
def translate_sanskrit_to_english(sanskrit_text):
messages = [
{"role": "system", "content": "You are a Sanskrit to English translation expert. Translate the given Sanskrit text accurately while preserving the meaning and context."},
{"role": "user", "content": f"Translate this Sanskrit text to English: {sanskrit_text}"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
return response.strip()
# English to Sanskrit Translation
def translate_english_to_sanskrit(english_text):
messages = [
{"role": "system", "content": "You are an English to Sanskrit translation expert. Translate the given English text accurately into Sanskrit while preserving the meaning and context."},
{"role": "user", "content": f"Translate this English text to Sanskrit: {english_text}"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
return response.strip()
# Transliteration (Devanagari to IAST)
def transliterate_to_iast(devanagari_text):
messages = [
{"role": "system", "content": "You are a Sanskrit transliteration expert. Convert the given Sanskrit text from Devanagari script to IAST (International Alphabet of Sanskrit Transliteration) format."},
{"role": "user", "content": f"Transliterate this Sanskrit text to IAST: {devanagari_text}"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
return response.strip()
# Example usage
sanskrit_text = "नमस्ते संस्कृत भाषा ।"
english_translation = translate_sanskrit_to_english(sanskrit_text)
print(f"Sanskrit: {sanskrit_text}")
print(f"English: {english_translation}")
iast_transliteration = transliterate_to_iast(sanskrit_text)
print(f"IAST: {iast_transliteration}")
Chat Interface
def chat_with_sanskrit_model(user_input, task_type="translation"):
if task_type == "translation":
system_prompt = "You are a Sanskrit to English translation expert. Translate the given Sanskrit text accurately while preserving the meaning and context."
elif task_type == "transliteration":
system_prompt = "You are a Sanskrit transliteration expert. Convert the given Sanskrit text from Devanagari script to IAST format."
else:
system_prompt = "You are a Sanskrit language expert. Help with Sanskrit text processing tasks."
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_input}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
return response.strip()
Training Procedure
Phase 1: Sanskrit Language Model Training
Hyperparameters:
- learning_rate: 0.0003
- train_batch_size: 2
- gradient_accumulation_steps: 2
- num_epochs: 3
- sequence_len: 1024
- lora_r: 32
- lora_alpha: 64
- optimizer: adamw_bnb_8bit
- lr_scheduler: cosine
Phase 2: Chat Template Enhancement
Hyperparameters:
- learning_rate: 0.0002
- train_batch_size: 2
- gradient_accumulation_steps: 4
- num_epochs: 3
- sequence_len: 1024
- lora_r: 32
- lora_alpha: 16
- optimizer: adamw_bnb_8bit
- lr_scheduler: cosine
Training Results
Phase 1 Results:
- Final Loss: ~0.15
- Training Time: ~8 hours (2x RTX 4090)
- Memory Usage: ~24GB per GPU
Phase 2 Results:
- Final Loss: ~0.12
- Training Time: ~4 hours (2x RTX 4090)
- Memory Usage: ~20GB per GPU
See Phase 1 axolotl config
axolotl version: 0.12.2
# Base model configuration
base_model: Qwen/Qwen2.5-7B
trust_remote_code: true
# Custom Sanskrit tokenizer configuration
tokenizer_config: sanskrit_tokenizer
tokenizer_type: AutoTokenizer
tokenizer_use_fast: true
tokenizer_legacy: false
tokenizer_save_jinja_files: true
# Resize embeddings for new tokens if needed
resize_token_embeddings_to_32x: true
mean_resizing_embeddings: false
shrink_embeddings: false
# Dataset configuration for text completion
datasets:
- path: sanskrit_clean_dataset.jsonl
type: completion
field: text
# Dataset preparation
dataset_prepared_path:
val_set_size: 0.1
output_dir: ./outputs/sanskrt-qwen-lora
# Sequence configuration
sequence_len: 1024
sample_packing: true
eval_sample_packing: true
# LoRA configuration
adapter: lora
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_linear: true
lora_target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
gradient_accumulation_steps: 2
micro_batch_size: 2
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0003
# Precision configuration
bf16: auto
tf32: true
# Memory optimization
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
flash_attention: true
# Training schedule
warmup_ratio: 0.1
evals_per_epoch: 4
saves_per_epoch: 1
weight_decay: 0.0
hub_model_id: diabolic6045/sanskrit-qwen2.5-7b-base-lora
See Phase 2 axolotl config
axolotl version: 0.12.2
# Post-Training Configuration for Sanskrit Translation & Transliteration Enhancement
base_model: diabolic6045/Sanskrit-Qwen2.5-7B-base
chat_template: qwen_25
datasets:
- path: sanskrit_train_set.jsonl
type: chat_template
field_messages: messages
message_property_mappings:
role: role
content: content
roles:
system:
- system
user:
- user
assistant:
- assistant
val_set_size: 0.01
output_dir: ./outputs/sanskrit-chat
adapter: lora
sequence_len: 1024
pad_to_sequence_len: false
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002
bf16: true
tf32: true
gradient_checkpointing: true
logging_steps: 1
flash_attention: true
warmup_ratio: 0.1
evals_per_epoch: 1
saves_per_epoch: 1
weight_decay: 0.0
hub_model_id: diabolic6045/Sanskrit-Qwen2.5-7B-chat-lora
Framework Versions
- PEFT 0.17.0
- Transformers 4.55.2
- PyTorch 2.7.1+cu128
- Datasets 4.0.0
- Tokenizers 0.21.2
- Axolotl 0.12.2
Limitations and Future Work
Current Limitations
- Translation Accuracy: Exact match accuracy is low, indicating room for improvement in precise translation
- Context Understanding: Limited ability to handle complex contextual nuances in translation
- Domain Specificity: Performance may vary across different Sanskrit literature genres
Future Improvements
- Larger Training Dataset: Expand with more diverse Sanskrit texts and translations
- Advanced Training Techniques: Implement DPO (Direct Preference Optimization) for better alignment
- Multimodal Capabilities: Add vision capabilities for Sanskrit manuscript processing
- Specialized Fine-tuning: Domain-specific fine-tuning for different Sanskrit literature types
Citation
If you use this model in your research, please cite:
@misc{sanskrit-qwen2.5-7b-chat,
title={Sanskrit Qwen2.5-7B Chat Model: A Specialized Language Model for Sanskrit Translation and Transliteration},
author={[Your Name]},
year={2024},
url={https://huggingface.co/diabolic6045/Sanskrit-Qwen2.5-7B-chat}
}
License
This model is released under the Apache 2.0 License. See the LICENSE file for more details.
Acknowledgments
- Base Model: Qwen/Qwen2.5-7B by Alibaba Cloud
- Training Framework: Axolotl by OpenAccess AI Collective
- Sanskrit Datasets: Various Sanskrit literature sources and translation corpora
- Community: Sanskrit NLP research community for datasets and feedback
- Downloads last month
- 10
Model tree for diabolic6045/Sanskrit-Qwen2.5-7B-chat
Dataset used to train diabolic6045/Sanskrit-Qwen2.5-7B-chat
Collection including diabolic6045/Sanskrit-Qwen2.5-7B-chat
Evaluation results
- Sanskrit to English Translation - Jaccard Similarity on Sanskrit Translation & Transliteration Test SetSanskrit Translation & Transliteration Evaluation30.900
- Sanskrit to English Translation - BLEU Score on Sanskrit Translation & Transliteration Test SetSanskrit Translation & Transliteration Evaluation10.900
- English to Sanskrit Translation - Jaccard Similarity on Sanskrit Translation & Transliteration Test SetSanskrit Translation & Transliteration Evaluation10.300
- English to Sanskrit Translation - BLEU Score on Sanskrit Translation & Transliteration Test SetSanskrit Translation & Transliteration Evaluation4.100
- Transliteration - Exact Match Accuracy on Sanskrit Translation & Transliteration Test SetSanskrit Translation & Transliteration Evaluation29.100
- Transliteration - Character Accuracy on Sanskrit Translation & Transliteration Test SetSanskrit Translation & Transliteration Evaluation96.700
- Success Rate on Sanskrit Translation & Transliteration Test SetSanskrit Translation & Transliteration Evaluation100.000