YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Bengali-Code LLM Training Pipeline
A comprehensive pipeline for training a Bengali language model specialized in code understanding and generation. The model is fine-tuned on Bengali programming tutorials, documentation, and code examples.
π Features
- Automated data collection from Bengali Wikipedia and Prothom Alo
- Custom tokenizer training with SentencePiece for Bengali text and code
- Model fine-tuning using TinyLlama base model
- Comprehensive evaluation suite for Bengali code generation
- GitHub Actions workflow for automated training
- Weights & Biases integration for experiment tracking
π Requirements
- Python 3.10 or higher
- CUDA-capable GPU (recommended)
- 16GB+ RAM
- Internet connection for data collection
π Quick Start
- Clone the repository:
git clone https://github.com/yourusername/bengali-code-llm.git
cd bengali-code-llm
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables:
export HUGGINGFACE_TOKEN="your_token_here"
export WANDB_API_KEY="your_wandb_key_here"
- Run the complete pipeline:
# Collect data
python scripts/data_collector.py
# Train tokenizer
python scripts/tokenizer_trainer.py
# Train model
python scripts/model_trainer.py
# Evaluate model
python scripts/model_evaluator.py
ποΈ Pipeline Components
Data Collection (scripts/data_collector.py)
- Scrapes Bengali text from Wikipedia and Prothom Alo
- Implements rate limiting and error handling
- Outputs processed data in JSON format
Tokenizer Training (scripts/tokenizer_trainer.py)
- Uses SentencePiece for tokenizer training
- Custom vocabulary with Bengali and code tokens
- Generates HuggingFace-compatible tokenizer files
Model Training (scripts/model_trainer.py)
- Fine-tunes TinyLlama model
- Implements efficient training with gradient accumulation
- Supports mixed precision training
- Integrates with Weights & Biases for tracking
Model Evaluation (scripts/model_evaluator.py)
- Comprehensive evaluation suite
- Tests code generation capabilities
- Measures BLEU and ROUGE scores
- Generates detailed evaluation reports
π Training Metrics
The training progress can be monitored through Weights & Biases:
- Loss curves
- Evaluation metrics
- Generated samples
- Resource utilization
π GitHub Actions Workflow
The repository includes an automated training pipeline that:
- Runs daily to incorporate new data
- Executes the complete training pipeline
- Uploads model artifacts
- Can be triggered manually
π Directory Structure
bengali-code-llm/
βββ .github/
β βββ workflows/
β βββ train_model.yml
βββ scripts/
β βββ data_collector.py
β βββ tokenizer_trainer.py
β βββ model_trainer.py
β βββ model_evaluator.py
βββ data/
β βββ raw/
βββ outputs/
β βββ tokenizer/
β βββ model/
β βββ evaluation/
βββ requirements.txt
βββ README.md
π― Model Performance
The model is evaluated on various tasks:
- Code generation in Bengali
- Code explanation and documentation
- Error detection and correction
- Algorithm explanation
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π€ Contributing
Contributions are welcome! Please feel free to submit issues and pull requests.
π§ Contact
For questions and feedback, please open an issue in the repository.
π Acknowledgments
- TinyLlama team for the base model
- HuggingFace for the Transformers library
- Weights & Biases for experiment tracking
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support