README.md for Hugging Face Model Card
---
datasets:
- custom
library_name: fairseq
model-index:
- name: Malayalam to Hindi Translation (Fairseq)
results:
- task:
name: Translation
type: translation
dataset:
name: Custom Malayalam-Hindi Parallel Corpus
type: translation
metrics:
- name: BLEU
type: bleu
value: 29.56
- name: COMET
type: comet
value: 0.62
- name: Hindi to Malayalam Translation (Fairseq)
results:
- task:
name: Translation
type: translation
dataset:
name: Custom Malayalam-Hindi Parallel Corpus
type: translation
metrics:
- name: BLEU
type: bleu
value: 11.08
- name: COMET
type: comet
value: 0.76
---
Malayalam โ Hindi Translation Model (Fairseq)
This is a **Neural Machine Translation (NMT)** model trained to translate between **Malayalam (ml)** and **Hindi (hi)** using the **Fairseq** framework. It was trained on a custom curated low-resource parallel corpus.
Model Architecture
- Framework: Fairseq (PyTorch)
- Architecture: Transformer
- Type: Sequence-to-sequence
- Layers: 6 encoder / 6 decoder
- Embedding size: 512
- FFN size: 2048
- Attention heads: 8
- Positional encoding: sinusoidal
- Tokenizer: SentencePiece (trained jointly on ml-hi)
- Vocabulary size: 32,000 (joint BPE)
Training Details
| Setting | Value |
|----------------------|------------------------|
| Framework | Fairseq (0.12.2) |
| Training steps | 100k |
| Optimizer | Adam + inverse sqrt LR |
| Batch size | 4096 tokens |
| Max tokens | 4096 |
| Dropout | 0.3 |
| BLEU (test set) | 28.5 |
| Hardware | 1 x V100 32GB GPU |
| Training time | ~16 hours |
Evaluation
The model was evaluated on a manually annotated Malayalam-Hindi test set consisting of 10,000 sentence pairs.
| Metric | Score |
| |---------|---------|
| | hi-ml |ml-hi |
|-------------------------------|
| BLEU | 11.08 | 29.56 |
| COMET | 0.76 | 0.62 |
Usage
In Fairseq (CLI)
```bash
fairseq-interactive /data-bin \
--path checkpoint_best.pt \
--task translation_multi_simple_epoch \
--lang-pairs hi-ml,ml-hi \
--source-lang <src_lang> \
--target-lang <tgt_lang> \
--batch-size 1 \
--beam 10 \
--remove-bpe \
--lenpen 1.2 \
--encoder-langtok src \
--decoder-langtok \
--skip-invalid-size-inputs-valid-test
In Python (Torch-based loading)
import torch
# Load model checkpoint
checkpoint = torch.load('checkpoint_best.pt')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
Note: To use this model effectively, you need the SentencePiece model (
spm.model) and the exact Fairseq dictionary files (dict.ml.txt,dict.hi.txt).
Dataset
This model was trained on a custom dataset compiled from:
- AI4Bharat OPUS Corpus
- Manually aligned Malayalam-Hindi sentences from news and educational data
- Crawled parallel content from Indian government websites (under open license)