AGS โ Arabic Generality Score (Sanadshabann/AGS)
AGS predicts a continuous Generality score for a target word in context in Arabic.
Given a sentence, wrap the target span with [TGT] and [/TGT], and the model outputs a single float (regression head).
This model complements Arabic Level of Dialectness (ALDi) by quantifying how general vs. specific a word usage is within its context.
Repo:
Sanadshabann/AGS
Base model:CAMeL-Lab/bert-base-arabic-camelbert-mix
Task: regression (num_labels=1,problem_type="regression")
Usage
Word-level AGS
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tok = AutoTokenizer.from_pretrained("Sanadshabann/AGS", use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained("Sanadshabann/AGS").eval()
text = "ูุฐุง ู
ุซุงู ู
ุน [TGT]ุงูููู
ุฉ[/TGT] ุงููุฏู ุฏุงุฎู ุงูุฌู
ูุฉ."
with torch.inference_mode():
score = model(**tok(text, return_tensors="pt", truncation=True)).logits.squeeze().item()
print("generality:", float(score))
Notes: problem_type=regression, num_labels=1. Evaluated with RMSE.
Sentence-level AGS
We score each word in context by wrapping it with [TGT]...[/TGT] (one at a time), then aggregate the word scores using the Generalized Harmonic Mean (GHM):
[ \mathrm{GHM}_p({x_i}) ;=; \Big(; \frac{1}{n}\sum_i (x_i + \varepsilon)^{-p} ;\Big)^{-1/p} ]
With p > 1, lower word scores pull more weight (emphasizing specificity). Set p = 1 for the standard harmonic mean.
import re, numpy as np, torch
from typing import List, Tuple
from transformers import AutoTokenizer, AutoModelForSequenceClassification
def simple_ar_word_split(text: str) -> List[str]:
# Split by whitespace only
return [t for t in text.split() if t.strip()]
@torch.inference_mode()
def predict_word_scores(sentence: str, tok, model, device: str = None, max_words: int = 64) -> Tuple[List[str], List[float]]:
device = device or ("cuda" if torch.cuda.is_available() else "cpu")
words = simple_ar_word_split(sentence)[:max_words]
if not words:
return [], []
marked_variants = [
re.sub(rf"\b{re.escape(w)}\b", f"[TGT]{w}[/TGT]", sentence, count=1)
for w in words
]
enc = tok(marked_variants, return_tensors="pt", truncation=True, padding=True)
enc = {k: v.to(device) for k, v in enc.items()}
logits = model.to(device)(**enc).logits.squeeze(-1) # [batch]
scores = [float(x) for x in logits.detach().cpu()]
return words, scores
def aggregate_sentence_score(word_scores, p: float = 2.0, eps: float = 1e-8) -> float:
# Generalized Harmonic Mean (GHM):
# GHM_p = ( mean( (x_i + eps)^(-p) ) )^(-1/p)
word_scores = np.array(word_scores, dtype=np.float64)
ghm = (np.mean((word_scores + eps) ** (-p))) ** (-1.0 / p)
return float(ghm)
# Example
repo = "Sanadshabann/AGS"
tok = AutoTokenizer.from_pretrained(repo, use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(repo).eval()
sentence = "ูุฐุง ู
ุซุงู ุจุณูุท ูููุงุณ ุงูุนู
ูู
ูุฉ ุนูู ู
ุณุชูู ุงูุฌู
ูุฉ."
words, scores = predict_word_scores(sentence, tok, model)
sent_score = aggregate_sentence_score(scores)
print("words:", words[:10], "...")
print("word_scores:", [round(s, 4) for s in scores[:10]])
print("sentence_score (GHM, p=2):", round(sent_score, 4))
Citation
If you use this model, please cite:
The Arabic Generality Score: Another Dimension of Modeling Arabic Dialectness
Sanad Shaban, Nizar Habash
MBZUAI; New York University Abu Dhabi
arXiv:2508.17347 (2025). Accepted to EMNLP 2025 Main Conference.
PDF: https://arxiv.org/pdf/2508.17347
BibTeX
@inproceedings{shaban-habash-2025-arabic,
title = "The {A}rabic Generality Score: Another Dimension of Modeling {A}rabic Dialectness",
author = "Sha{'}ban, Sanad and
Habash, Nizar",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.1524/",
pages = "29990--30001",
ISBN = "979-8-89176-332-6",
abstract = "Arabic dialects form a diverse continuum, yet NLP models often treat them as discrete categories. Recent work addresses this issue by modeling dialectness as a continuous variable, notably through the Arabic Level of Dialectness (ALDi). However, ALDi reduces complex variation to a single dimension. We propose a complementary measure: the Arabic Generality Score (AGS), which quantifies how widely a word is used across dialects. We introduce a pipeline that combines word alignment, etymology-aware edit distance, and smoothing to annotate a parallel corpus with word-level AGS. A regression model is then trained to predict AGS in context. Our approach outperforms strong baselines, including state-of-the-art dialect ID systems, on a multi-dialect benchmark. AGS offers a scalable, linguistically grounded way to model lexical generality, enriching representations of Arabic dialectness. Code is publicly available at https://github.com/CAMeL-Lab/arabic-generality-score."
}
- Downloads last month
- 8,279
Model tree for Sanadshabann/AGS
Base model
CAMeL-Lab/bert-base-arabic-camelbert-mix