GaroEmbed: Cross-Lingual Sentence Embeddings for Garo

GaroEmbed is the first neural sentence embedding model for Garo (Tibeto-Burman language, ~1.2M speakers in Meghalaya, India). It aligns Garo semantic space with English through contrastive learning, achieving 29.33% Top-1 and 65.33% Top-5 cross-lingual retrieval accuracy.

Model Description

  • Model Type: BiLSTM Sentence Encoder with Contrastive Learning
  • Language: Garo (sat) โ†” English (en)
  • Training Data: 3,000 Garo-English parallel sentence pairs
  • Base Embeddings: GaroVec (FastText 300d with char n-grams)
  • Output Dimension: 384d (aligned with MiniLM)
  • Parameters: 10.7M
  • Training Time: ~15 minutes on RTX A4500

Performance

Metric Score
Top-1 Accuracy 29.33%
Top-5 Accuracy 65.33%
Top-10 Accuracy 72.67%
Mean Reciprocal Rank 0.4512
Avg Cosine Similarity 0.3446

88x improvement over mean-pooled GaroVec baseline (0.33% โ†’ 29.33% Top-1).

Usage

Requirements

pip install torch fasttext-wheel sentence-transformers huggingface-hub

Loading the Model

import torch
import torch.nn as nn
import fasttext
from huggingface_hub import hf_hub_download

# Download model checkpoint
checkpoint_path = hf_hub_download(
    repo_id="Badnyal/GaroEmbed",
    filename="garoembed_best.pt"
)

# Download GaroVec embeddings (required)
garovec_path = hf_hub_download(
    repo_id="MWirelabs/GaroVec",
    filename="garovec_garo.bin"
)

# Load GaroVec
garo_fasttext = fasttext.load_model(garovec_path)

# Define model architecture (see model_architecture.py in repo)
class GaroEmbed(nn.Module):
    def __init__(self, garo_fasttext_model, embedding_dim=300, hidden_dim=512, output_dim=384, dropout=0.3):
        super(GaroEmbed, self).__init__()
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim
        vocab_size = len(garo_fasttext_model.words)
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        weights = []
        for word in garo_fasttext_model.words:
            weights.append(garo_fasttext_model.get_word_vector(word))
        weights_tensor = torch.FloatTensor(weights)
        self.embedding.weight.data.copy_(weights_tensor)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=2, bidirectional=True, dropout=dropout, batch_first=True)
        self.projection = nn.Linear(hidden_dim * 2, output_dim)
        self.word2idx = {word: idx for idx, word in enumerate(garo_fasttext_model.words)}
        self.fasttext_model = garo_fasttext_model

    def tokenize_and_encode(self, sentences):
        batch_indices = []
        batch_lengths = []
        for sentence in sentences:
            tokens = sentence.lower().split()
            indices = []
            for token in tokens:
                if token in self.word2idx:
                    indices.append(self.word2idx[token])
                else:
                    indices.append(0)
            if len(indices) == 0:
                indices = [0]
            batch_indices.append(indices)
            batch_lengths.append(len(indices))
        return batch_indices, batch_lengths

    def forward(self, sentences):
        batch_indices, batch_lengths = self.tokenize_and_encode(sentences)
        max_len = max(batch_lengths)
        device = next(self.parameters()).device
        padded = torch.zeros(len(sentences), max_len, dtype=torch.long, device=device)
        for i, indices in enumerate(batch_indices):
            padded[i, :len(indices)] = torch.LongTensor(indices)
        embedded = self.embedding(padded)
        packed = nn.utils.rnn.pack_padded_sequence(embedded, batch_lengths, batch_first=True, enforce_sorted=False)
        lstm_out, (hidden, cell) = self.lstm(packed)
        forward_hidden = hidden[-2]
        backward_hidden = hidden[-1]
        combined = torch.cat([forward_hidden, backward_hidden], dim=1)
        sentence_embedding = self.projection(combined)
        sentence_embedding = nn.functional.normalize(sentence_embedding, p=2, dim=1)
        return sentence_embedding

# Initialize and load weights
model = GaroEmbed(garo_fasttext, output_dim=384)
checkpoint = torch.load(checkpoint_path, map_location='cpu')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Encode Garo sentences
garo_sentences = [
    "Anga namjanika",
    "Rikgiparang kamko suala"
]

with torch.no_grad():
    embeddings = model(garo_sentences)
    print(f"Embeddings shape: {embeddings.shape}")  # [2, 384]

Cross-Lingual Retrieval

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load English encoder (frozen anchor)
english_encoder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Encode Garo and English
garo_texts = ["Anga namjanika", "Garo biapni dokana"]
english_texts = ["I feel bad", "About Garo culture", "The weather is nice"]

garo_embeds = model(garo_texts).detach().numpy()
english_embeds = english_encoder.encode(english_texts, normalize_embeddings=True)

# Compute similarities
similarities = cosine_similarity(garo_embeds, english_embeds)
print("Garo-English similarities:")
print(similarities)

Training Details

  • Architecture: 2-layer BiLSTM (512 hidden units) + Linear projection
  • Loss: InfoNCE contrastive loss (temperature=0.07)
  • Optimizer: Adam (lr=2ร—10โปโด)
  • Batch Size: 32
  • Epochs: 20
  • Regularization: Dropout 0.3, frozen GaroVec embeddings
  • English Anchor: Frozen MiniLM (sentence-transformers/all-MiniLM-L6-v2)

Limitations

  • Trained on only 3,000 parallel pairs (limited semantic coverage)
  • Domain: Daily conversation and cultural topics (lacks technical/literary language)
  • Orthography: Latin script only
  • Morphology: Does not explicitly model Garo's agglutinative structure
  • Evaluation: Limited to retrieval tasks

Acknowledgments

License

MIT License - Free for research and commercial use

Contact


First neural sentence embedding model for Garo language โ€ข Enabling NLP for low-resource Tibeto-Burman languages

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support