AGS — Arabic Generality Score (`Sanadshabann/AGS`)

AGS predicts a continuous Generality score for a target word in context in Arabic. Given a sentence, wrap the target span with [TGT] and [/TGT], and the model outputs a single float (regression head). This model complements Arabic Level of Dialectness (ALDi) by quantifying how general vs. specific a word usage is within its context.

Repo: Sanadshabann/AGS
Base model: CAMeL-Lab/bert-base-arabic-camelbert-mix
Task: regression (num_labels=1, problem_type="regression")

Usage

Word-level AGS

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tok = AutoTokenizer.from_pretrained("Sanadshabann/AGS", use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained("Sanadshabann/AGS").eval()

text = "هذا مثال مع [TGT]الكلمة[/TGT] الهدف داخل الجملة."
with torch.inference_mode():
    score = model(**tok(text, return_tensors="pt", truncation=True)).logits.squeeze().item()
print("generality:", float(score))

Notes: problem_type=regression, num_labels=1. Evaluated with RMSE.

Sentence-level AGS

We score each word in context by wrapping it with [TGT]...[/TGT] (one at a time), then aggregate the word scores using the Generalized Harmonic Mean (GHM):

[ \mathrm{GHM}_p({x_i}) ;=; \Big(; \frac{1}{n}\sum_i (x_i + \varepsilon)^{-p} ;\Big)^{-1/p} ]

With p > 1, lower word scores pull more weight (emphasizing specificity). Set p = 1 for the standard harmonic mean.

import re, numpy as np, torch
from typing import List, Tuple
from transformers import AutoTokenizer, AutoModelForSequenceClassification

def simple_ar_word_split(text: str) -> List[str]:
    # Split by whitespace only
    return [t for t in text.split() if t.strip()]

@torch.inference_mode()
def predict_word_scores(sentence: str, tok, model, device: str = None, max_words: int = 64) -> Tuple[List[str], List[float]]:
    device = device or ("cuda" if torch.cuda.is_available() else "cpu")
    words = simple_ar_word_split(sentence)[:max_words]
    if not words:
        return [], []
    marked_variants = [
        re.sub(rf"\b{re.escape(w)}\b", f"[TGT]{w}[/TGT]", sentence, count=1)
        for w in words
    ]
    enc = tok(marked_variants, return_tensors="pt", truncation=True, padding=True)
    enc = {k: v.to(device) for k, v in enc.items()}
    logits = model.to(device)(**enc).logits.squeeze(-1)  # [batch]
    scores = [float(x) for x in logits.detach().cpu()]
    return words, scores

def aggregate_sentence_score(word_scores, p: float = 2.0, eps: float = 1e-8) -> float:
    # Generalized Harmonic Mean (GHM):
    #   GHM_p = ( mean( (x_i + eps)^(-p) ) )^(-1/p)
    word_scores = np.array(word_scores, dtype=np.float64)
    ghm = (np.mean((word_scores + eps) ** (-p))) ** (-1.0 / p)
    return float(ghm)

# Example
repo = "Sanadshabann/AGS"
tok = AutoTokenizer.from_pretrained(repo, use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(repo).eval()

sentence = "هذا مثال بسيط لقياس العمومية على مستوى الجملة."
words, scores = predict_word_scores(sentence, tok, model)
sent_score = aggregate_sentence_score(scores)

print("words:", words[:10], "...")
print("word_scores:", [round(s, 4) for s in scores[:10]])
print("sentence_score (GHM, p=2):", round(sent_score, 4))

Citation

If you use this model, please cite:

The Arabic Generality Score: Another Dimension of Modeling Arabic Dialectness
Sanad Shaban, Nizar Habash
MBZUAI; New York University Abu Dhabi
arXiv:2508.17347 (2025). Accepted to EMNLP 2025 Main Conference.

PDF: https://arxiv.org/pdf/2508.17347

BibTeX

@inproceedings{shaban-habash-2025-arabic,
    title = "The {A}rabic Generality Score: Another Dimension of Modeling {A}rabic Dialectness",
    author = "Sha{'}ban, Sanad  and
      Habash, Nizar",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.1524/",
    pages = "29990--30001",
    ISBN = "979-8-89176-332-6",
    abstract = "Arabic dialects form a diverse continuum, yet NLP models often treat them as discrete categories. Recent work addresses this issue by modeling dialectness as a continuous variable, notably through the Arabic Level of Dialectness (ALDi). However, ALDi reduces complex variation to a single dimension. We propose a complementary measure: the Arabic Generality Score (AGS), which quantifies how widely a word is used across dialects. We introduce a pipeline that combines word alignment, etymology-aware edit distance, and smoothing to annotate a parallel corpus with word-level AGS. A regression model is then trained to predict AGS in context. Our approach outperforms strong baselines, including state-of-the-art dialect ID systems, on a multi-dialect benchmark. AGS offers a scalable, linguistically grounded way to model lexical generality, enriching representations of Arabic dialectness. Code is publicly available at https://github.com/CAMeL-Lab/arabic-generality-score."
}

Downloads last month: 8,279

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for Sanadshabann/AGS

Base model

CAMeL-Lab/bert-base-arabic-camelbert-mix

Finetuned

(7)

this model