Note: This project represents a series of experimental attempts to create specialized Sanskrit language models. While significant effort was invested, the results did not meet the quality standards required for practical use. This repository is maintained for research documentation and learning purposes.

Sanskrit Qwen2.5-7B Chat Model

This model is a fine-tuned version of diabolic6045/Sanskrit-Qwen2.5-7B-base specialized for Sanskrit language tasks including bidirectional translation (Sanskrit ↔ English) and transliteration (Devanagari to IAST). The model has been trained in two phases: initial Sanskrit language modeling followed by chat template enhancement for translation and transliteration capabilities.

Model Description

This is a specialized version of Qwen2.5-7B that has been enhanced for Sanskrit language processing through a two-phase training approach:

Phase 1: Sanskrit Language Model Training

  • Trained on 664,104+ clean Sanskrit texts from 9 different sources
  • Enhanced with Sanskrit-specific punctuation tokens (। ॥)
  • Custom tokenizer with proper Devanagari script support

Phase 2: Chat Template Enhancement

  • Post-trained on 94,363+ bidirectional translation and transliteration examples
  • Chat template format for conversation-style interactions
  • Specialized system prompts for different task types

Key Features:

  • Base Model: Qwen/Qwen2.5-7B (7 billion parameters)
  • Training Method: Two-phase LoRA fine-tuning
  • Languages: Sanskrit (Devanagari script) and English
  • Tasks: Translation, Transliteration, Text Generation
  • Architecture: Transformer-based with LoRA adapters

Capabilities:

  • Bidirectional Translation: Sanskrit ↔ English translation
  • Transliteration: Devanagari script to IAST format
  • Sanskrit Text Generation: Coherent Sanskrit verse and prose generation
  • Chat Interface: Conversational responses with task understanding
  • Context Preservation: Maintains meaning during translation tasks

Training and Evaluation Data

Training Datasets

Phase 1 - Sanskrit Language Model:

  • Dataset: diabolic6045/Sanskrit-shlok-collection
  • Size: 664,104 unique Sanskrit texts
  • Sources: 9 different Sanskrit literature sources
  • Format: Clean text completion format
  • Content: Classical Sanskrit texts, verses, and prose

Phase 2 - Chat Enhancement:

  • Dataset: diabolic6045/Sanskrit-transliteration-chat-dataset
  • Size: 94,363 conversation examples
  • Tasks:
    • Sanskrit to English translation: 20,205 examples
    • English to Sanskrit translation: 20,205 examples
    • Transliteration (Devanagari to IAST): 53,953 examples
  • Format: Chat template with system prompts

Evaluation Results

The model was evaluated on a comprehensive test set of 500 samples from Test Dataset across three task types:

Task Type Samples Exact Match Jaccard Similarity BLEU Score Character Accuracy
Sanskrit → English 114 0.0% 30.9% 10.9% -
English → Sanskrit 101 0.0% 10.3% 4.1% -
Transliteration 285 29.1% - - 96.7%
Overall 500 - - - -

Key Insights:

  • 100% Success Rate: All test samples processed without failures
  • High Transliteration Accuracy: 96.7% character-level accuracy for Devanagari to IAST conversion
  • Moderate Translation Performance: Room for improvement in exact translation matching
  • Strong Character Recognition: Excellent handling of Sanskrit script and punctuation

Usage

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "diabolic6045/Sanskrit-Qwen2.5-7B-chat"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Sanskrit to English Translation
def translate_sanskrit_to_english(sanskrit_text):
    messages = [
        {"role": "system", "content": "You are a Sanskrit to English translation expert. Translate the given Sanskrit text accurately while preserving the meaning and context."},
        {"role": "user", "content": f"Translate this Sanskrit text to English: {sanskrit_text}"}
    ]
    
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
    
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return response.strip()

# English to Sanskrit Translation
def translate_english_to_sanskrit(english_text):
    messages = [
        {"role": "system", "content": "You are an English to Sanskrit translation expert. Translate the given English text accurately into Sanskrit while preserving the meaning and context."},
        {"role": "user", "content": f"Translate this English text to Sanskrit: {english_text}"}
    ]
    
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
    
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return response.strip()

# Transliteration (Devanagari to IAST)
def transliterate_to_iast(devanagari_text):
    messages = [
        {"role": "system", "content": "You are a Sanskrit transliteration expert. Convert the given Sanskrit text from Devanagari script to IAST (International Alphabet of Sanskrit Transliteration) format."},
        {"role": "user", "content": f"Transliterate this Sanskrit text to IAST: {devanagari_text}"}
    ]
    
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
    
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return response.strip()

# Example usage
sanskrit_text = "नमस्ते संस्कृत भाषा ।"
english_translation = translate_sanskrit_to_english(sanskrit_text)
print(f"Sanskrit: {sanskrit_text}")
print(f"English: {english_translation}")

iast_transliteration = transliterate_to_iast(sanskrit_text)
print(f"IAST: {iast_transliteration}")

Chat Interface

def chat_with_sanskrit_model(user_input, task_type="translation"):
    if task_type == "translation":
        system_prompt = "You are a Sanskrit to English translation expert. Translate the given Sanskrit text accurately while preserving the meaning and context."
    elif task_type == "transliteration":
        system_prompt = "You are a Sanskrit transliteration expert. Convert the given Sanskrit text from Devanagari script to IAST format."
    else:
        system_prompt = "You are a Sanskrit language expert. Help with Sanskrit text processing tasks."
    
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_input}
    ]
    
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
    
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return response.strip()

Training Procedure

Phase 1: Sanskrit Language Model Training

Hyperparameters:

  • learning_rate: 0.0003
  • train_batch_size: 2
  • gradient_accumulation_steps: 2
  • num_epochs: 3
  • sequence_len: 1024
  • lora_r: 32
  • lora_alpha: 64
  • optimizer: adamw_bnb_8bit
  • lr_scheduler: cosine

Phase 2: Chat Template Enhancement

Hyperparameters:

  • learning_rate: 0.0002
  • train_batch_size: 2
  • gradient_accumulation_steps: 4
  • num_epochs: 3
  • sequence_len: 1024
  • lora_r: 32
  • lora_alpha: 16
  • optimizer: adamw_bnb_8bit
  • lr_scheduler: cosine

Training Results

Phase 1 Results:

  • Final Loss: ~0.15
  • Training Time: ~8 hours (2x RTX 4090)
  • Memory Usage: ~24GB per GPU

Phase 2 Results:

  • Final Loss: ~0.12
  • Training Time: ~4 hours (2x RTX 4090)
  • Memory Usage: ~20GB per GPU

This model was trained using:
Built with Axolotl

See Phase 1 axolotl config

axolotl version: 0.12.2

# Base model configuration
base_model: Qwen/Qwen2.5-7B
trust_remote_code: true

# Custom Sanskrit tokenizer configuration
tokenizer_config: sanskrit_tokenizer
tokenizer_type: AutoTokenizer
tokenizer_use_fast: true
tokenizer_legacy: false
tokenizer_save_jinja_files: true

# Resize embeddings for new tokens if needed
resize_token_embeddings_to_32x: true
mean_resizing_embeddings: false
shrink_embeddings: false

# Dataset configuration for text completion
datasets:
  - path: sanskrit_clean_dataset.jsonl
    type: completion
    field: text

# Dataset preparation
dataset_prepared_path: 
val_set_size: 0.1
output_dir: ./outputs/sanskrt-qwen-lora

# Sequence configuration
sequence_len: 1024
sample_packing: true
eval_sample_packing: true

# LoRA configuration
adapter: lora
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_linear: true
lora_target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

gradient_accumulation_steps: 2
micro_batch_size: 2
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0003

# Precision configuration
bf16: auto
tf32: true

# Memory optimization
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
flash_attention: true

# Training schedule
warmup_ratio: 0.1
evals_per_epoch: 4
saves_per_epoch: 1
weight_decay: 0.0

hub_model_id: diabolic6045/sanskrit-qwen2.5-7b-base-lora
See Phase 2 axolotl config

axolotl version: 0.12.2

# Post-Training Configuration for Sanskrit Translation & Transliteration Enhancement
base_model: diabolic6045/Sanskrit-Qwen2.5-7B-base

chat_template: qwen_25
datasets:
  - path: sanskrit_train_set.jsonl
    type: chat_template
    field_messages: messages
    message_property_mappings:
      role: role
      content: content
    roles:
      system:
        - system
      user:
        - user
      assistant:
        - assistant

val_set_size: 0.01
output_dir: ./outputs/sanskrit-chat

adapter: lora
sequence_len: 1024
pad_to_sequence_len: false

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

bf16: true
tf32: true

gradient_checkpointing: true
logging_steps: 1
flash_attention: true

warmup_ratio: 0.1
evals_per_epoch: 1
saves_per_epoch: 1
weight_decay: 0.0

hub_model_id: diabolic6045/Sanskrit-Qwen2.5-7B-chat-lora

Framework Versions

  • PEFT 0.17.0
  • Transformers 4.55.2
  • PyTorch 2.7.1+cu128
  • Datasets 4.0.0
  • Tokenizers 0.21.2
  • Axolotl 0.12.2

Limitations and Future Work

Current Limitations

  • Translation Accuracy: Exact match accuracy is low, indicating room for improvement in precise translation
  • Context Understanding: Limited ability to handle complex contextual nuances in translation
  • Domain Specificity: Performance may vary across different Sanskrit literature genres

Future Improvements

  • Larger Training Dataset: Expand with more diverse Sanskrit texts and translations
  • Advanced Training Techniques: Implement DPO (Direct Preference Optimization) for better alignment
  • Multimodal Capabilities: Add vision capabilities for Sanskrit manuscript processing
  • Specialized Fine-tuning: Domain-specific fine-tuning for different Sanskrit literature types

Citation

If you use this model in your research, please cite:

@misc{sanskrit-qwen2.5-7b-chat,
  title={Sanskrit Qwen2.5-7B Chat Model: A Specialized Language Model for Sanskrit Translation and Transliteration},
  author={[Your Name]},
  year={2024},
  url={https://huggingface.co/diabolic6045/Sanskrit-Qwen2.5-7B-chat}
}

License

This model is released under the Apache 2.0 License. See the LICENSE file for more details.

Acknowledgments

  • Base Model: Qwen/Qwen2.5-7B by Alibaba Cloud
  • Training Framework: Axolotl by OpenAccess AI Collective
  • Sanskrit Datasets: Various Sanskrit literature sources and translation corpora
  • Community: Sanskrit NLP research community for datasets and feedback
Downloads last month
10
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for diabolic6045/Sanskrit-Qwen2.5-7B-chat

Base model

Qwen/Qwen2.5-7B
Finetuned
(1)
this model
Quantizations
1 model

Dataset used to train diabolic6045/Sanskrit-Qwen2.5-7B-chat

Collection including diabolic6045/Sanskrit-Qwen2.5-7B-chat

Evaluation results