YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Bengali-Code LLM Training Pipeline

A comprehensive pipeline for training a Bengali language model specialized in code understanding and generation. The model is fine-tuned on Bengali programming tutorials, documentation, and code examples.

🌟 Features

  • Automated data collection from Bengali Wikipedia and Prothom Alo
  • Custom tokenizer training with SentencePiece for Bengali text and code
  • Model fine-tuning using TinyLlama base model
  • Comprehensive evaluation suite for Bengali code generation
  • GitHub Actions workflow for automated training
  • Weights & Biases integration for experiment tracking

πŸ“‹ Requirements

  • Python 3.10 or higher
  • CUDA-capable GPU (recommended)
  • 16GB+ RAM
  • Internet connection for data collection

πŸš€ Quick Start

  1. Clone the repository:
git clone https://github.com/yourusername/bengali-code-llm.git
cd bengali-code-llm
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables:
export HUGGINGFACE_TOKEN="your_token_here"
export WANDB_API_KEY="your_wandb_key_here"
  1. Run the complete pipeline:
# Collect data
python scripts/data_collector.py

# Train tokenizer
python scripts/tokenizer_trainer.py

# Train model
python scripts/model_trainer.py

# Evaluate model
python scripts/model_evaluator.py

πŸ—οΈ Pipeline Components

Data Collection (scripts/data_collector.py)

  • Scrapes Bengali text from Wikipedia and Prothom Alo
  • Implements rate limiting and error handling
  • Outputs processed data in JSON format

Tokenizer Training (scripts/tokenizer_trainer.py)

  • Uses SentencePiece for tokenizer training
  • Custom vocabulary with Bengali and code tokens
  • Generates HuggingFace-compatible tokenizer files

Model Training (scripts/model_trainer.py)

  • Fine-tunes TinyLlama model
  • Implements efficient training with gradient accumulation
  • Supports mixed precision training
  • Integrates with Weights & Biases for tracking

Model Evaluation (scripts/model_evaluator.py)

  • Comprehensive evaluation suite
  • Tests code generation capabilities
  • Measures BLEU and ROUGE scores
  • Generates detailed evaluation reports

πŸ“Š Training Metrics

The training progress can be monitored through Weights & Biases:

  • Loss curves
  • Evaluation metrics
  • Generated samples
  • Resource utilization

πŸ”„ GitHub Actions Workflow

The repository includes an automated training pipeline that:

  • Runs daily to incorporate new data
  • Executes the complete training pipeline
  • Uploads model artifacts
  • Can be triggered manually

πŸ“ Directory Structure

bengali-code-llm/
β”œβ”€β”€ .github/
β”‚   └── workflows/
β”‚       └── train_model.yml
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ data_collector.py
β”‚   β”œβ”€β”€ tokenizer_trainer.py
β”‚   β”œβ”€β”€ model_trainer.py
β”‚   └── model_evaluator.py
β”œβ”€β”€ data/
β”‚   └── raw/
β”œβ”€β”€ outputs/
β”‚   β”œβ”€β”€ tokenizer/
β”‚   β”œβ”€β”€ model/
β”‚   └── evaluation/
β”œβ”€β”€ requirements.txt
└── README.md

🎯 Model Performance

The model is evaluated on various tasks:

  • Code generation in Bengali
  • Code explanation and documentation
  • Error detection and correction
  • Algorithm explanation

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

🀝 Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

πŸ“§ Contact

For questions and feedback, please open an issue in the repository.

πŸ™ Acknowledgments

  • TinyLlama team for the base model
  • HuggingFace for the Transformers library
  • Weights & Biases for experiment tracking
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support