Tigre Low-Resource Language Resource Collection (Tigre-Data 1.0)

Overview

This repository introduces the first comprehensive public collection of resources for the Tigre language — an under-resourced South Semitic language within the Afro-Asiatic family. The release aggregates multiple modalities (text + speech) and provides baseline models for NLP tasks including language modeling, automatic speech recognition (ASR), and machine translation.

The goal of Tigre-Data 1.0 is to accelerate research in low-resource NLP, morphologically rich language modeling, speech technologies, and cross-lingual transfer.


tigre-sonar-encoder

A Tigre–English semantic similarity and quality-checking encoder, fine-tuned from the SONAR universal embedding model.

Key Capabilities

  • Generates 1024-dimensional embeddings for Tigre and English text
  • Computes cosine similarity for translation validation and filtering
  • Supports retrieval, clustering, and cross-lingual semantic tasks

Model Description

Input Language: Tigre (tig, script: Ethiopic — tig_Ethi)
Base Model: facebook/nllb-200-distilled-1.3B
Model Type: Encoder-only (text embedding model)
Purpose: Align Tigre embeddings with the universal SONAR cross-lingual space


Training Method: Knowledge Distillation

The model was trained with a teacher–student distillation pipeline:

1. Model & Tokenizer Preparation

  • Initialized from the NLLB-200 distilled encoder
  • Extended tokenizer with Tigre-specific vocabulary
  • New token embeddings initialized by averaging sub-token embeddings

2. Teacher Embedding Generation

  • SONAR embedding model used as the Teacher
  • English translations of Tigre sentences encoded into 1024-dimensional vectors

3. Distillation Fine-Tuning

  • Minimized Mean Squared Error (MSE) loss between Student (Tigre encoder) and Teacher embeddings
  • Forced the Tigre model to align with the universal cross-lingual space

Training Details

  • Dataset: train_tig_parallel_text.parquet
  • Contents: Tigre sentences paired with gold-standard SONAR embeddings
  • Objective: MSE loss between model output and SONAR target vectors
  • Tokenizer: Extended NLLB tokenizer with Tigre-specific vocabulary

Evaluation Results

Metric Result Description
Accuracy (Source → Target) 0.88 Retrieval accuracy when querying with Tigre text
Accuracy (Target → Source) 0.78 Retrieval accuracy when querying with English text
BLEU 30.74 (BLEU relates to a separate MT evaluation, not this encoder)

Usage Example (Python)

pip install transformers torch
```python

from transformers import AutoTokenizer, M2M100ForConditionalGeneration
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# load your Tigre-trained encoder
model_id = "BeitTigreAI/tigre-sonar-encoder"
seq2seq = M2M100ForConditionalGeneration.from_pretrained(model_id)
encoder = seq2seq.get_encoder().to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id)

@torch.inference_mode()
def embed(texts, lang):
    tokenizer.src_lang = lang
    batch = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
    out = encoder(**batch, return_dict=True)
    mask = batch["attention_mask"].unsqueeze(-1).float()
    pooled = (out.last_hidden_state * mask).sum(dim=1) / mask.sum(dim=1).clamp_min(1.0)
    return torch.nn.functional.normalize(pooled, p=2, dim=1)

def score_pair(tig, eng):
    t = embed([tig], "tig_Ethi")
    e = embed([eng], "eng_Latn")
    sim = float((t*e).sum())
    return round(sim*100, 1)

print(score_pair("እት እድንየ እግል ትርኤ ተሐዜዮ ተቅዪር ግበእ", "Be the change that you wish to see in the world"))
print(score_pair("ክል ዶል ኢገብእ መስል እስከ ይከለስ", "It always seems impossible until it's done"))
---

## License

**CC BY-SA 4.0**

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BeitTigreAI/tigre-sonar-encoder

Base model

facebook/SONAR
Finetuned
(1)
this model