KinyCOMET — Translation Quality Estimation for Kinyarwanda ↔ English
Model Description
KinyCOMET is a breakthrough neural translation quality estimation model specifically designed for Kinyarwanda ↔ English translation evaluation. Addressing a critical gap in the Rwandan NLP ecosystem where BLEU scores poorly correlate with human judgment, KinyCOMET provides state-of-the-art automatic evaluation that aligns strongly with human assessments.
Why KinyCOMET Matters: Rwanda's thriving MT ecosystem includes companies like Digital Umuganda, KINLP, Awesomity, and Artemis AI, but the community faces significant evaluation challenges. Human evaluation is expensive and time-consuming, while BLEU scores don't capture translation quality effectively for morphologically rich languages like Kinyarwanda.
Key Features:
- Superior Correlation: Achieves 0.75 Pearson correlation with human judgments (vs. 0.30 for BLEU)
- Bidirectional Excellence: Optimized for both Kinyarwanda→English and English→Kinyarwanda
- Community-Driven: Trained on 4,323 human-annotated samples from 15 linguistics students
- Production-Ready: Direct Assessment scoring aligned with WMT evaluation standards
- Open Science: Fully open model, dataset, and training pipeline
Model Variants & Performance
| Variant | Base Model | Pearson | Spearman | Kendall's τ | MAE |
|---|---|---|---|---|---|
| KinyCOMET-Unbabel | Unbabel/wmt22-comet-da | 0.75 | 0.59 | 0.42 | 0.07 |
| KinyCOMET-XLM | XLM-RoBERTa-large | 0.73 | 0.50 | 0.35 | 0.07 |
| Unbabel (baseline) | wmt22-comet-da | 0.54 | 0.55 | 0.39 | 0.17 |
| AfriCOMET STL 1.1 | AfriCOMET base | 0.52 | 0.35 | 0.24 | 0.18 |
| BLEU | N/A | 0.30 | 0.34 | 0.23 | 0.62 |
| chrF | N/A | 0.38 | 0.30 | 0.21 | 0.34 |
State-of-the-Art Results: Both KinyCOMET variants significantly outperform existing baselines, with KinyCOMET-Unbabel achieving the highest correlation across all metrics.
Performance Highlights
Comprehensive Evaluation Results
Overall Performance (Both Directions)
- Pearson Correlation: 0.75 (KinyCOMET-Unbabel) vs 0.30 (BLEU) - 2.5x improvement
- Spearman Correlation: 0.59 vs 0.34 (BLEU) - 73% improvement
- Mean Absolute Error: 0.07 vs 0.62 (BLEU) - 89% reduction
Directional Analysis
| Direction | Model | Pearson | Spearman | Kendall's τ |
|---|---|---|---|---|
| English → Kinyarwanda | KinyCOMET-XLM | 0.76 | 0.52 | 0.37 |
| English → Kinyarwanda | KinyCOMET-Unbabel | 0.75 | 0.56 | 0.40 |
| Kinyarwanda → English | KinyCOMET-Unbabel | 0.63 | 0.47 | 0.33 |
| Kinyarwanda → English | KinyCOMET-XLM | 0.37 | 0.29 | 0.21 |
Key Insights:
- English→Kinyarwanda consistently outperforms Kinyarwanda→English across all metrics
- Both KinyCOMET variants significantly outperform AfriCOMET baselines despite including Kinyarwanda
- Surprising finding: Unbabel baseline (not trained on Kinyarwanda) outperforms AfriCOMET variants
Installation
Make sure you have Python ≥ 3.8 and install COMET via pip:
pip install unbabel-comet
You can verify the CLI tool is installed:
which comet-score
# should print something like: /usr/local/bin/comet-score
For more details on COMET, see the official documentation.
Usage
Load and Use the Model in Python
Here's a simple example to score translations directly in Python:
from comet import load_from_checkpoint
# Load the public KinyCOMET model
model = load_from_checkpoint("chrismazii/kinycomet_unbabel")
# Example translations
samples = [
{
"src": "Umugabo ararya.",
"mt": "The man is eating.",
"ref": "The man is eating."
},
{
"src": "Umwana arasinzira.",
"mt": "A dog sleeps.",
"ref": "The child is sleeping."
}
]
# Predict scores
pred = model.predict(samples, gpus=0)
print(pred)
Output Example:
Prediction({
'scores': [0.9899, 0.8813],
'system_score': 0.9356
})
Using the Command Line Interface (CLI)
You can also evaluate translations directly using the terminal.
Step 1: Create the text files
cat > source.txt <<'SRC'
Umugabo ararya.
Umwana arasinzira.
Uyu mwanya neza cyane.
SRC
cat > reference.txt <<'REF'
The man is eating.
The child is sleeping.
This place is very nice.
REF
cat > hypothesis.txt <<'HYP'
The man is eating.
A dog sleeps.
This place is very nice.
HYP
Step 2: Run KinyCOMET
comet-score -s source.txt -r reference.txt -t hypothesis.txt \
--model chrismazii/kinycomet_unbabel --gpus 0 --to_json results.json
Step 3: View the results
cat results.json
Example Output:
{
"system_score": 0.9547,
"segments": [
{"src":"Umugabo ararya.","mt":"The man is eating.","ref":"The man is eating.","score":0.9899},
{"src":"Umwana arasinzira.","mt":"A dog sleeps.","ref":"The child is sleeping.","score":0.8813},
{"src":"Uyu mwanya neza cyane.","mt":"This place is very nice.","ref":"This place is very nice.","score":0.9927}
]
}
Score Interpretation
- Scores range from 0 to 1: Higher scores indicate better translation quality
- System score: Average quality across all translations
- Segment scores: Individual quality scores for each translation pair
- Threshold guidance: Scores above 0.8 typically indicate high-quality translations
Training Details
Model Architecture
- Base Models: XLM-RoBERTa-large and Unbabel/wmt22-comet-da
- Framework: COMET quality estimation framework
- Training Data: 4,323 human-annotated Kinyarwanda-English translation pairs
Training Configuration
- Methodology: COMET framework with Direct Assessment supervision
- Evaluation Metrics: Kendall's τ and Spearman ρ correlation with human DA scores
- Data Split: 80% train (3,497) / 10% validation (404) / 10% test (422)
MT System Benchmarking Results
Our evaluation of production MT systems reveals interesting insights:
| MT System | Kinyarwanda→English | English→Kinyarwanda | Overall |
|---|---|---|---|
| GPT-4o | 93.10% ± 7.77 | 87.83% ± 11.15 | 90.69% ± 9.82 |
| GPT-4.1 | 93.08% ± 6.62 | 87.92% ± 10.38 | 90.75% ± 8.90 |
| Gemini Flash 2.0 | 91.46% ± 11.39 | 90.02% ± 8.92 | 90.80% ± 10.35 |
| Claude 3.7 | 92.48% ± 8.32 | 85.75% ± 11.28 | 89.43% ± 10.33 |
| NLLB-1.3B | 89.42% ± 12.04 | 83.96% ± 16.31 | 86.78% ± 14.52 |
| NLLB-600M | 88.87% ± 12.11 | 75.46% ± 28.49 | 82.71% ± 22.27 |
Key Findings:
- LLM-based systems significantly outperform traditional neural MT
- All systems perform better on Kinyarwanda→English than English→Kinyarwanda
- Score differences are subtle but statistically meaningful with KinyCOMET's precision
Real-World Impact & Applications
Addressing Rwanda's NLP Ecosystem Needs
KinyCOMET directly addresses pain points identified by the Rwandan MT community:
Before KinyCOMET:
- BLEU scores poorly correlate with human judgment for Kinyarwanda
- Expensive, time-consuming human evaluation required
- No reliable automatic metrics for morphologically rich Kinyarwanda
With KinyCOMET:
- 2.5x better correlation with human judgments than BLEU
- Instant evaluation for production MT systems
- Cost-effective alternative to human annotation
- Specialized for Kinyarwanda morphological complexity
Production Use Cases
For MT Companies (Digital Umuganda, KINLP, Awesomity, Artemis AI):
- Real-time translation quality monitoring
- A/B testing of model improvements
- Quality gates for production deployments
For Researchers & Developers:
- Benchmark new Kinyarwanda MT models
- Dataset quality assessment
- Cross-lingual transfer learning evaluation
For Content & Localization:
- Prioritize post-editing efforts
- Quality assurance workflows
- User confidence scoring
Limitations & Considerations
- Domain Specificity: Trained on education and tourism domains; may not generalize to all content types
- Language Variants: Optimized for standard Kinyarwanda; dialectal variations may affect performance
- Resource Requirements: Requires COMET library and substantial computational resources
- Score Interpretation: Scores are relative to training data distribution
- Reference Dependency: Best performance achieved with reference translations
Dataset Access
The training dataset is available separately. See the KinyCOMET Dataset Card for details on accessing the human-annotated quality estimation data.
Citation & Research
If you use KinyCOMET in your research, please cite:
@misc{kinycomet2025,
title={KinyCOMET: Translation Quality Estimation for Kinyarwanda-English},
author={Prince Chris Mazimpaka and Jan Nehring},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/chrismazii/kinycomet_unbabel}}
}
Contributing to African NLP
KinyCOMET contributes to the growing ecosystem of African language NLP tools. We encourage:
- Community Feedback: Report issues and suggest improvements
- Extension Work: Adapt for other African languages
- Dataset Contributions: Share additional Kinyarwanda evaluation data
- Collaborative Research: Partner on African language translation quality research
License
This model is released under the Apache 2.0 License.
Acknowledgments
- COMET Framework: Built on the excellent COMET quality estimation framework
- Base Models: Leverages XLM-RoBERTa and Unbabel's WMT22 COMET-DA models
- African NLP Community: Inspired by ongoing efforts to advance African language technologies
- Contributors: Thanks to the 15 linguistics students and all researchers who made this work possible
Resources:
- Downloads last month
- 10
Model tree for chrismazii/kinycomet_unbabel
Base model
FacebookAI/xlm-roberta-largeEvaluation results
- Pearson Correlation on Kinyarwanda-English QE Datasetself-reported0.751
- Spearman Correlation on Kinyarwanda-English QE Datasetself-reported0.593
- System Score on Kinyarwanda-English QE Datasetself-reported0.896
