README.md for Hugging Face Model Card
datasets:
- custom
library_name: onmt
model-index:
- name: Hindi to Malayalam Translation
results:
- task:
name: Translation
type: translation
dataset:
name: Custom Hindi- Malayalam Parallel Corpus
type: translation
metrics:
- name: BLEU
type: bleu
value: 11.07
- name : COMET
- type:comet
- value: 0.832
---
Hindi to Malayalam Translation Model (OpenNMT)
This is a Neural Machine Translation (NMT) model trained to translate Hindi (hi) to Malayalam (ml) using the OpenNMT framework. It was trained on a custom curated low-resource parallel corpus.
Model Architecture
- Framework: OpenNMT (PyTorch)
- Architecture: Transformer
- Type: Sequence-to-sequence
- Layers: 6 encoder / 6 decoder
- Embedding size: 512
- FFN size: 2048
- Attention heads: 8
- Positional encoding: sinusoidal
- Tokenizer: SentencePiece (trained jointly on hi-ml)
- Vocabulary size: 32,000 (joint BPE)
Training Details
| Setting | Value |
|----------------------|------------------------|
| Framework | OpenNMT (3.5.1) |
| Training steps | 800k |
| Optimizer | Adam + inverse sqrt LR |
| Batch size | 8192 tokens |
| Max tokens | 8192 |
| Dropout | 0.3 |
| BLEU (test set) | 11.07 |
| Hardware | 1 x V100 32GB GPU |
| Training time | ~15 hours |
Evaluation
The model was evaluated on a manually annotated Hindi-Malayalam test set consisting of 10,000 sentence pairs.
| Metric | Score |
|--------|---------|
| BLEU | 11.07 |
| COMET | 0.832 |
Usage
IN CLI
```bash
onmt_translate \
-model hi_ml_onmt.pt \
-src input.txt \
-output output.txt \
-replace_unk \
-verbose \
-gpu -1 \
-min_length 1
In Python (Torch-based loading)
import torch
import onmt.model_builder
import onmt.inputters
import onmt.opts
# Path to your model
model_path = "model.tm_best_checkpoint.pt"
# Load the checkpoint (map to CPU or GPU based on availability)
checkpoint = torch.load(model_path, map_location=torch.device("cpu"))
# Extract model options and vocab/fields
model_opt = checkpoint['opt']
fields = checkpoint['vocab']
# Build the model using OpenNMT-py utilities
model = onmt.model_builder.build_base_model(model_opt, fields, use_gpu=False, checkpoint=checkpoint)
# Set model to eval mode
model.eval()
Dataset
This model was trained on a custom dataset compiled from:
- AI4Bharat OPUS Corpus
- Manually aligned Malayalam-Hindi sentences from news and educational data
- Crawled parallel content from Indian government websites (under open license)