KinyCOMET — Translation Quality Estimation for Kinyarwanda ↔ English

KinyCOMET Banner

Model Description

KinyCOMET is a breakthrough neural translation quality estimation model specifically designed for Kinyarwanda ↔ English translation evaluation. Addressing a critical gap in the Rwandan NLP ecosystem where BLEU scores poorly correlate with human judgment, KinyCOMET provides state-of-the-art automatic evaluation that aligns strongly with human assessments.

Why KinyCOMET Matters: Rwanda's thriving MT ecosystem includes companies like Digital Umuganda, KINLP, Awesomity, and Artemis AI, but the community faces significant evaluation challenges. Human evaluation is expensive and time-consuming, while BLEU scores don't capture translation quality effectively for morphologically rich languages like Kinyarwanda.

Key Features:

  • Superior Correlation: Achieves 0.75 Pearson correlation with human judgments (vs. 0.30 for BLEU)
  • Bidirectional Excellence: Optimized for both Kinyarwanda→English and English→Kinyarwanda
  • Community-Driven: Trained on 4,323 human-annotated samples from 15 linguistics students
  • Production-Ready: Direct Assessment scoring aligned with WMT evaluation standards
  • Open Science: Fully open model, dataset, and training pipeline

Model Variants & Performance

Variant Base Model Pearson Spearman Kendall's τ MAE
KinyCOMET-Unbabel Unbabel/wmt22-comet-da 0.75 0.59 0.42 0.07
KinyCOMET-XLM XLM-RoBERTa-large 0.73 0.50 0.35 0.07
Unbabel (baseline) wmt22-comet-da 0.54 0.55 0.39 0.17
AfriCOMET STL 1.1 AfriCOMET base 0.52 0.35 0.24 0.18
BLEU N/A 0.30 0.34 0.23 0.62
chrF N/A 0.38 0.30 0.21 0.34

State-of-the-Art Results: Both KinyCOMET variants significantly outperform existing baselines, with KinyCOMET-Unbabel achieving the highest correlation across all metrics.

Performance Highlights

Comprehensive Evaluation Results

Overall Performance (Both Directions)

  • Pearson Correlation: 0.75 (KinyCOMET-Unbabel) vs 0.30 (BLEU) - 2.5x improvement
  • Spearman Correlation: 0.59 vs 0.34 (BLEU) - 73% improvement
  • Mean Absolute Error: 0.07 vs 0.62 (BLEU) - 89% reduction

Directional Analysis

Direction Model Pearson Spearman Kendall's τ
English → Kinyarwanda KinyCOMET-XLM 0.76 0.52 0.37
English → Kinyarwanda KinyCOMET-Unbabel 0.75 0.56 0.40
Kinyarwanda → English KinyCOMET-Unbabel 0.63 0.47 0.33
Kinyarwanda → English KinyCOMET-XLM 0.37 0.29 0.21

Key Insights:

  • English→Kinyarwanda consistently outperforms Kinyarwanda→English across all metrics
  • Both KinyCOMET variants significantly outperform AfriCOMET baselines despite including Kinyarwanda
  • Surprising finding: Unbabel baseline (not trained on Kinyarwanda) outperforms AfriCOMET variants

Installation

Make sure you have Python ≥ 3.8 and install COMET via pip:

pip install unbabel-comet

You can verify the CLI tool is installed:

which comet-score
# should print something like: /usr/local/bin/comet-score

For more details on COMET, see the official documentation.

Usage

Load and Use the Model in Python

Here's a simple example to score translations directly in Python:

from comet import load_from_checkpoint

# Load the public KinyCOMET model
model = load_from_checkpoint("chrismazii/kinycomet_unbabel")

# Example translations
samples = [
    {
        "src": "Umugabo ararya.",
        "mt": "The man is eating.",
        "ref": "The man is eating."
    },
    {
        "src": "Umwana arasinzira.",
        "mt": "A dog sleeps.",
        "ref": "The child is sleeping."
    }
]

# Predict scores
pred = model.predict(samples, gpus=0)
print(pred)

Output Example:

Prediction({
  'scores': [0.9899, 0.8813],
  'system_score': 0.9356
})

Using the Command Line Interface (CLI)

You can also evaluate translations directly using the terminal.

Step 1: Create the text files

cat > source.txt <<'SRC'
Umugabo ararya.
Umwana arasinzira.
Uyu mwanya neza cyane.
SRC

cat > reference.txt <<'REF'
The man is eating.
The child is sleeping.
This place is very nice.
REF

cat > hypothesis.txt <<'HYP'
The man is eating.
A dog sleeps.
This place is very nice.
HYP

Step 2: Run KinyCOMET

comet-score -s source.txt -r reference.txt -t hypothesis.txt \
  --model chrismazii/kinycomet_unbabel --gpus 0 --to_json results.json

Step 3: View the results

cat results.json

Example Output:

{
  "system_score": 0.9547,
  "segments": [
    {"src":"Umugabo ararya.","mt":"The man is eating.","ref":"The man is eating.","score":0.9899},
    {"src":"Umwana arasinzira.","mt":"A dog sleeps.","ref":"The child is sleeping.","score":0.8813},
    {"src":"Uyu mwanya neza cyane.","mt":"This place is very nice.","ref":"This place is very nice.","score":0.9927}
  ]
}

Score Interpretation

  • Scores range from 0 to 1: Higher scores indicate better translation quality
  • System score: Average quality across all translations
  • Segment scores: Individual quality scores for each translation pair
  • Threshold guidance: Scores above 0.8 typically indicate high-quality translations

Training Details

Model Architecture

  • Base Models: XLM-RoBERTa-large and Unbabel/wmt22-comet-da
  • Framework: COMET quality estimation framework
  • Training Data: 4,323 human-annotated Kinyarwanda-English translation pairs

Training Configuration

  • Methodology: COMET framework with Direct Assessment supervision
  • Evaluation Metrics: Kendall's τ and Spearman ρ correlation with human DA scores
  • Data Split: 80% train (3,497) / 10% validation (404) / 10% test (422)

MT System Benchmarking Results

Our evaluation of production MT systems reveals interesting insights:

MT System Kinyarwanda→English English→Kinyarwanda Overall
GPT-4o 93.10% ± 7.77 87.83% ± 11.15 90.69% ± 9.82
GPT-4.1 93.08% ± 6.62 87.92% ± 10.38 90.75% ± 8.90
Gemini Flash 2.0 91.46% ± 11.39 90.02% ± 8.92 90.80% ± 10.35
Claude 3.7 92.48% ± 8.32 85.75% ± 11.28 89.43% ± 10.33
NLLB-1.3B 89.42% ± 12.04 83.96% ± 16.31 86.78% ± 14.52
NLLB-600M 88.87% ± 12.11 75.46% ± 28.49 82.71% ± 22.27

Key Findings:

  • LLM-based systems significantly outperform traditional neural MT
  • All systems perform better on Kinyarwanda→English than English→Kinyarwanda
  • Score differences are subtle but statistically meaningful with KinyCOMET's precision

Real-World Impact & Applications

Addressing Rwanda's NLP Ecosystem Needs

KinyCOMET directly addresses pain points identified by the Rwandan MT community:

Before KinyCOMET:

  • BLEU scores poorly correlate with human judgment for Kinyarwanda
  • Expensive, time-consuming human evaluation required
  • No reliable automatic metrics for morphologically rich Kinyarwanda

With KinyCOMET:

  • 2.5x better correlation with human judgments than BLEU
  • Instant evaluation for production MT systems
  • Cost-effective alternative to human annotation
  • Specialized for Kinyarwanda morphological complexity

Production Use Cases

For MT Companies (Digital Umuganda, KINLP, Awesomity, Artemis AI):

  • Real-time translation quality monitoring
  • A/B testing of model improvements
  • Quality gates for production deployments

For Researchers & Developers:

  • Benchmark new Kinyarwanda MT models
  • Dataset quality assessment
  • Cross-lingual transfer learning evaluation

For Content & Localization:

  • Prioritize post-editing efforts
  • Quality assurance workflows
  • User confidence scoring

Limitations & Considerations

  • Domain Specificity: Trained on education and tourism domains; may not generalize to all content types
  • Language Variants: Optimized for standard Kinyarwanda; dialectal variations may affect performance
  • Resource Requirements: Requires COMET library and substantial computational resources
  • Score Interpretation: Scores are relative to training data distribution
  • Reference Dependency: Best performance achieved with reference translations

Dataset Access

The training dataset is available separately. See the KinyCOMET Dataset Card for details on accessing the human-annotated quality estimation data.

Citation & Research

If you use KinyCOMET in your research, please cite:

@misc{kinycomet2025,
    title={KinyCOMET: Translation Quality Estimation for Kinyarwanda-English},
    author={Prince Chris Mazimpaka and Jan Nehring},
    year={2025},
    publisher={Hugging Face},
    howpublished={\url{https://huggingface.co/chrismazii/kinycomet_unbabel}}
}

Contributing to African NLP

KinyCOMET contributes to the growing ecosystem of African language NLP tools. We encourage:

  • Community Feedback: Report issues and suggest improvements
  • Extension Work: Adapt for other African languages
  • Dataset Contributions: Share additional Kinyarwanda evaluation data
  • Collaborative Research: Partner on African language translation quality research

License

This model is released under the Apache 2.0 License.

Acknowledgments

  • COMET Framework: Built on the excellent COMET quality estimation framework
  • Base Models: Leverages XLM-RoBERTa and Unbabel's WMT22 COMET-DA models
  • African NLP Community: Inspired by ongoing efforts to advance African language technologies
  • Contributors: Thanks to the 15 linguistics students and all researchers who made this work possible

Resources:

Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chrismazii/kinycomet_unbabel

Finetuned
(780)
this model

Evaluation results

  • Pearson Correlation on Kinyarwanda-English QE Dataset
    self-reported
    0.751
  • Spearman Correlation on Kinyarwanda-English QE Dataset
    self-reported
    0.593
  • System Score on Kinyarwanda-English QE Dataset
    self-reported
    0.896