KinyCOMET — Translation Quality Estimation for Kinyarwanda ↔ English

Model Description

KinyCOMET is a breakthrough neural translation quality estimation model specifically designed for Kinyarwanda ↔ English translation evaluation. Addressing a critical gap in the Rwandan NLP ecosystem where BLEU scores poorly correlate with human judgment, KinyCOMET provides state-of-the-art automatic evaluation that aligns strongly with human assessments.

Why KinyCOMET Matters: Rwanda's thriving MT ecosystem includes companies like Digital Umuganda, KINLP, Awesomity, and Artemis AI, but the community faces significant evaluation challenges. Human evaluation is expensive and time-consuming, while BLEU scores don't capture translation quality effectively for morphologically rich languages like Kinyarwanda.

Key Features:

Superior Correlation: Achieves 0.75 Pearson correlation with human judgments (vs. 0.30 for BLEU)
Bidirectional Excellence: Optimized for both Kinyarwanda→English and English→Kinyarwanda
Community-Driven: Trained on 4,323 human-annotated samples from 15 linguistics students
Production-Ready: Direct Assessment scoring aligned with WMT evaluation standards
Open Science: Fully open model, dataset, and training pipeline

Model Variants & Performance

Variant	Base Model	Pearson	Spearman	Kendall's τ	MAE
KinyCOMET-Unbabel	Unbabel/wmt22-comet-da	0.75	0.59	0.42	0.07
KinyCOMET-XLM	XLM-RoBERTa-large	0.73	0.50	0.35	0.07
Unbabel (baseline)	wmt22-comet-da	0.54	0.55	0.39	0.17
AfriCOMET STL 1.1	AfriCOMET base	0.52	0.35	0.24	0.18
BLEU	N/A	0.30	0.34	0.23	0.62
chrF	N/A	0.38	0.30	0.21	0.34

State-of-the-Art Results: Both KinyCOMET variants significantly outperform existing baselines, with KinyCOMET-Unbabel achieving the highest correlation across all metrics.

Performance Highlights

Comprehensive Evaluation Results

Overall Performance (Both Directions)

Pearson Correlation: 0.75 (KinyCOMET-Unbabel) vs 0.30 (BLEU) - 2.5x improvement
Spearman Correlation: 0.59 vs 0.34 (BLEU) - 73% improvement
Mean Absolute Error: 0.07 vs 0.62 (BLEU) - 89% reduction

Directional Analysis

Direction	Model	Pearson	Spearman	Kendall's τ
English → Kinyarwanda	KinyCOMET-XLM	0.76	0.52	0.37
English → Kinyarwanda	KinyCOMET-Unbabel	0.75	0.56	0.40
Kinyarwanda → English	KinyCOMET-Unbabel	0.63	0.47	0.33
Kinyarwanda → English	KinyCOMET-XLM	0.37	0.29	0.21

Key Insights:

English→Kinyarwanda consistently outperforms Kinyarwanda→English across all metrics
Both KinyCOMET variants significantly outperform AfriCOMET baselines despite including Kinyarwanda
Surprising finding: Unbabel baseline (not trained on Kinyarwanda) outperforms AfriCOMET variants

Installation

Make sure you have Python ≥ 3.8 and install COMET via pip:

pip install unbabel-comet

You can verify the CLI tool is installed:

which comet-score
# should print something like: /usr/local/bin/comet-score

For more details on COMET, see the official documentation.

Usage

Load and Use the Model in Python

Here's a simple example to score translations directly in Python:

from comet import load_from_checkpoint

# Load the public KinyCOMET model
model = load_from_checkpoint("chrismazii/kinycomet_unbabel")

# Example translations
samples = [
    {
        "src": "Umugabo ararya.",
        "mt": "The man is eating.",
        "ref": "The man is eating."
    },
    {
        "src": "Umwana arasinzira.",
        "mt": "A dog sleeps.",
        "ref": "The child is sleeping."
    }
]

# Predict scores
pred = model.predict(samples, gpus=0)
print(pred)

Output Example:

Prediction({
  'scores': [0.9899, 0.8813],
  'system_score': 0.9356
})

Using the Command Line Interface (CLI)

You can also evaluate translations directly using the terminal.

Step 1: Create the text files

cat > source.txt <<'SRC'
Umugabo ararya.
Umwana arasinzira.
Uyu mwanya neza cyane.
SRC

cat > reference.txt <<'REF'
The man is eating.
The child is sleeping.
This place is very nice.
REF

cat > hypothesis.txt <<'HYP'
The man is eating.
A dog sleeps.
This place is very nice.
HYP

Step 2: Run KinyCOMET

comet-score -s source.txt -r reference.txt -t hypothesis.txt \
  --model chrismazii/kinycomet_unbabel --gpus 0 --to_json results.json

Step 3: View the results

cat results.json

Example Output:

{
  "system_score": 0.9547,
  "segments": [
    {"src":"Umugabo ararya.","mt":"The man is eating.","ref":"The man is eating.","score":0.9899},
    {"src":"Umwana arasinzira.","mt":"A dog sleeps.","ref":"The child is sleeping.","score":0.8813},
    {"src":"Uyu mwanya neza cyane.","mt":"This place is very nice.","ref":"This place is very nice.","score":0.9927}
  ]
}

Score Interpretation

Scores range from 0 to 1: Higher scores indicate better translation quality
System score: Average quality across all translations
Segment scores: Individual quality scores for each translation pair
Threshold guidance: Scores above 0.8 typically indicate high-quality translations

Training Details

Model Architecture

Base Models: XLM-RoBERTa-large and Unbabel/wmt22-comet-da
Framework: COMET quality estimation framework
Training Data: 4,323 human-annotated Kinyarwanda-English translation pairs

Training Configuration

Methodology: COMET framework with Direct Assessment supervision
Evaluation Metrics: Kendall's τ and Spearman ρ correlation with human DA scores
Data Split: 80% train (3,497) / 10% validation (404) / 10% test (422)

MT System Benchmarking Results

Our evaluation of production MT systems reveals interesting insights:

MT System	Kinyarwanda→English	English→Kinyarwanda	Overall
GPT-4o	93.10% ± 7.77	87.83% ± 11.15	90.69% ± 9.82
GPT-4.1	93.08% ± 6.62	87.92% ± 10.38	90.75% ± 8.90
Gemini Flash 2.0	91.46% ± 11.39	90.02% ± 8.92	90.80% ± 10.35
Claude 3.7	92.48% ± 8.32	85.75% ± 11.28	89.43% ± 10.33
NLLB-1.3B	89.42% ± 12.04	83.96% ± 16.31	86.78% ± 14.52
NLLB-600M	88.87% ± 12.11	75.46% ± 28.49	82.71% ± 22.27

Key Findings:

LLM-based systems significantly outperform traditional neural MT
All systems perform better on Kinyarwanda→English than English→Kinyarwanda
Score differences are subtle but statistically meaningful with KinyCOMET's precision

Real-World Impact & Applications

Addressing Rwanda's NLP Ecosystem Needs

KinyCOMET directly addresses pain points identified by the Rwandan MT community:

Before KinyCOMET:

BLEU scores poorly correlate with human judgment for Kinyarwanda
Expensive, time-consuming human evaluation required
No reliable automatic metrics for morphologically rich Kinyarwanda

With KinyCOMET:

2.5x better correlation with human judgments than BLEU
Instant evaluation for production MT systems
Cost-effective alternative to human annotation
Specialized for Kinyarwanda morphological complexity

Production Use Cases

For MT Companies (Digital Umuganda, KINLP, Awesomity, Artemis AI):

Real-time translation quality monitoring
A/B testing of model improvements
Quality gates for production deployments

For Researchers & Developers:

Benchmark new Kinyarwanda MT models
Dataset quality assessment
Cross-lingual transfer learning evaluation

For Content & Localization:

Prioritize post-editing efforts
Quality assurance workflows
User confidence scoring

Limitations & Considerations

Domain Specificity: Trained on education and tourism domains; may not generalize to all content types
Language Variants: Optimized for standard Kinyarwanda; dialectal variations may affect performance
Resource Requirements: Requires COMET library and substantial computational resources
Score Interpretation: Scores are relative to training data distribution
Reference Dependency: Best performance achieved with reference translations

Dataset Access

The training dataset is available separately. See the KinyCOMET Dataset Card for details on accessing the human-annotated quality estimation data.

Citation & Research

If you use KinyCOMET in your research, please cite:

@misc{kinycomet2025,
    title={KinyCOMET: Translation Quality Estimation for Kinyarwanda-English},
    author={Prince Chris Mazimpaka and Jan Nehring},
    year={2025},
    publisher={Hugging Face},
    howpublished={\url{https://huggingface.co/chrismazii/kinycomet_unbabel}}
}

Contributing to African NLP

KinyCOMET contributes to the growing ecosystem of African language NLP tools. We encourage:

Community Feedback: Report issues and suggest improvements
Extension Work: Adapt for other African languages
Dataset Contributions: Share additional Kinyarwanda evaluation data
Collaborative Research: Partner on African language translation quality research

License

This model is released under the Apache 2.0 License.

Acknowledgments

COMET Framework: Built on the excellent COMET quality estimation framework
Base Models: Leverages XLM-RoBERTa and Unbabel's WMT22 COMET-DA models
African NLP Community: Inspired by ongoing efforts to advance African language technologies
Contributors: Thanks to the 15 linguistics students and all researchers who made this work possible

Resources:

Downloads last month: 10

Model tree for chrismazii/kinycomet_unbabel

Base model

FacebookAI/xlm-roberta-large

Finetuned

(780)

this model

Evaluation results

Pearson Correlation on Kinyarwanda-English QE Dataset
self-reported

0.751
Spearman Correlation on Kinyarwanda-English QE Dataset
self-reported

0.593
System Score on Kinyarwanda-English QE Dataset
self-reported

0.896

View on Papers With Code