YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Bengali-Code LLM Training Pipeline

A comprehensive pipeline for training a Bengali language model specialized in code understanding and generation. The model is fine-tuned on Bengali programming tutorials, documentation, and code examples.

🌟 Features

Automated data collection from Bengali Wikipedia and Prothom Alo
Custom tokenizer training with SentencePiece for Bengali text and code
Model fine-tuning using TinyLlama base model
Comprehensive evaluation suite for Bengali code generation
GitHub Actions workflow for automated training
Weights & Biases integration for experiment tracking

📋 Requirements

Python 3.10 or higher
CUDA-capable GPU (recommended)
16GB+ RAM
Internet connection for data collection

🚀 Quick Start

Clone the repository:

git clone https://github.com/yourusername/bengali-code-llm.git
cd bengali-code-llm

Install dependencies:

pip install -r requirements.txt

Set up environment variables:

export HUGGINGFACE_TOKEN="your_token_here"
export WANDB_API_KEY="your_wandb_key_here"

Run the complete pipeline:

# Collect data
python scripts/data_collector.py

# Train tokenizer
python scripts/tokenizer_trainer.py

# Train model
python scripts/model_trainer.py

# Evaluate model
python scripts/model_evaluator.py

🏗️ Pipeline Components

Data Collection (`scripts/data_collector.py`)

Scrapes Bengali text from Wikipedia and Prothom Alo
Implements rate limiting and error handling
Outputs processed data in JSON format

Tokenizer Training (`scripts/tokenizer_trainer.py`)

Uses SentencePiece for tokenizer training
Custom vocabulary with Bengali and code tokens
Generates HuggingFace-compatible tokenizer files

Model Training (`scripts/model_trainer.py`)

Fine-tunes TinyLlama model
Implements efficient training with gradient accumulation
Supports mixed precision training
Integrates with Weights & Biases for tracking

Model Evaluation (`scripts/model_evaluator.py`)

Comprehensive evaluation suite
Tests code generation capabilities
Measures BLEU and ROUGE scores
Generates detailed evaluation reports

📊 Training Metrics

The training progress can be monitored through Weights & Biases:

Loss curves
Evaluation metrics
Generated samples
Resource utilization

🔄 GitHub Actions Workflow

The repository includes an automated training pipeline that:

Runs daily to incorporate new data
Executes the complete training pipeline
Uploads model artifacts
Can be triggered manually

📁 Directory Structure

bengali-code-llm/
├── .github/
│   └── workflows/
│       └── train_model.yml
├── scripts/
│   ├── data_collector.py
│   ├── tokenizer_trainer.py
│   ├── model_trainer.py
│   └── model_evaluator.py
├── data/
│   └── raw/
├── outputs/
│   ├── tokenizer/
│   ├── model/
│   └── evaluation/
├── requirements.txt
└── README.md

🎯 Model Performance

The model is evaluated on various tasks:

Code generation in Bengali
Code explanation and documentation
Error detection and correction
Algorithm explanation

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

📧 Contact

For questions and feedback, please open an issue in the repository.

🙏 Acknowledgments

TinyLlama team for the base model
HuggingFace for the Transformers library
Weights & Biases for experiment tracking

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support