README.md for Hugging Face Model Card

---

datasets:
  - custom
library_name: fairseq
model-index:
  - name: Malayalam to Hindi Translation (Fairseq)
    results:
      - task:
          name: Translation
          type: translation
        dataset:
          name: Custom Malayalam-Hindi Parallel Corpus
          type: translation
        metrics:
          - name: BLEU
            type: bleu
            value: 29.56
      - name: COMET
            type: comet
            value: 0.62      
               
  - name: Hindi to Malayalam Translation (Fairseq)
    results:
      - task:
          name: Translation
          type: translation
        dataset:
          name: Custom Malayalam-Hindi Parallel Corpus
          type: translation
        metrics:
          - name: BLEU
            type: bleu
            value: 11.08   
          - name: COMET
            type: comet
            value: 0.76       
---

Malayalam ↔ Hindi Translation Model (Fairseq)

This is a **Neural Machine Translation (NMT)** model trained to translate between **Malayalam (ml)** and **Hindi (hi)** using the **Fairseq** framework. It was trained on a custom curated low-resource parallel corpus.

Model Architecture

- Framework: Fairseq (PyTorch)
- Architecture: Transformer
- Type: Sequence-to-sequence
- Layers: 6 encoder / 6 decoder
- Embedding size: 512
- FFN size: 2048
- Attention heads: 8
- Positional encoding: sinusoidal
- Tokenizer: SentencePiece (trained jointly on ml-hi)
- Vocabulary size: 32,000 (joint BPE)

Training Details

| Setting              | Value                  |
|----------------------|------------------------|
| Framework            | Fairseq (0.12.2)       |
| Training steps       | 100k                   |
| Optimizer            | Adam + inverse sqrt LR |
| Batch size           | 4096 tokens            |
| Max tokens           | 4096                   |
| Dropout              | 0.3                    |
| BLEU (test set)      | 28.5                   |
| Hardware             | 1 x V100 32GB GPU      |
| Training time        | ~16 hours              |

Evaluation

The model was evaluated on a manually annotated Malayalam-Hindi test set consisting of 10,000 sentence pairs.

| Metric    |       Score       |
|           |---------|---------|
|           | hi-ml   |ml-hi    |
|-------------------------------|
| BLEU      | 11.08   | 29.56   |
| COMET     | 0.76    | 0.62    |

Usage

In Fairseq (CLI)

```bash
fairseq-interactive /data-bin \
  --path checkpoint_best.pt \
  --task translation_multi_simple_epoch \
  --lang-pairs hi-ml,ml-hi \
  --source-lang <src_lang> \
  --target-lang <tgt_lang> \
  --batch-size 1 \
  --beam 10 \
  --remove-bpe \
  --lenpen 1.2 \
  --encoder-langtok src \
  --decoder-langtok \
  --skip-invalid-size-inputs-valid-test

In Python (Torch-based loading)

import torch

# Load model checkpoint
checkpoint = torch.load('checkpoint_best.pt')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

Note: To use this model effectively, you need the SentencePiece model (spm.model) and the exact Fairseq dictionary files (dict.ml.txt, dict.hi.txt).

Dataset

This model was trained on a custom dataset compiled from:

AI4Bharat OPUS Corpus
Manually aligned Malayalam-Hindi sentences from news and educational data
Crawled parallel content from Indian government websites (under open license)

Downloads last month: -; Downloads are not tracked for this model. How to track