MorphoArabia at BAREC 2025 Shared Task: A Hybrid Architecture with Morphological Analysis for Arabic Readability Assessmen

Barec Readability Assessment

This repository contains the official models and results for MorphoArabia, the submission to the BAREC 2025 Shared Task on Arabic Readability Assessment.

By: Fatimah Mohamed Emad Elden

Cairo University

Model Description

This project introduces a morphologically-aware approach for assessing the readability of Arabic text. The system is built around a fine-tuned regression model designed to process morphologically analyzed text. For the Constrained and Open tracks of the shared task, this core model is extended into a hybrid architecture that incorporates seven engineered lexical features.

A key element of this system is its deep morphological preprocessing pipeline, which uses the CAMEL Tools d3tok analyzer. This allows the model to capture linguistic complexities that are often missed by surface-level tokenization methods. This approach proved to be highly effective, achieving a peak Quadratic Weighted Kappa (QWK) score of 84.2 on the strict sentence-level test set.

The model predicts a readability score on a 19-level scale, from 1 (easiest) to 19 (hardest), for a given Arabic sentence or document.

Hybrid Arabic Readability Model (Constrained Track - Document Level)

This repository contains a fine-tuned hybrid model for document-level Arabic readability assessment. It was trained for the Constrained Track of the BAREC competition.

The model combines the textual understanding of CAMeL-Lab/readability-arabertv2-d3tok-reg with 7 additional lexical features to produce a regression-based readability score for full documents.

NOTE: This is a custom model architecture. You must use the trust_remote_code=True argument when loading it.

How to Use

The model requires both the document text and a tensor containing 7 numerical features.

Step 1: Installation

Install the necessary libraries:

pip install transformers torch pandas arabert

Step 2: Full Inference Example

This example shows how to preprocess a document, extract features, and get a readability score.

import torch
import numpy as np
from transformers import AutoTokenizer, AutoModel
from arabert.preprocess import ArabertPreprocessor

# --- 1. Define the Feature Engineering Function ---
def get_lexical_features(text, lexicon):
    words = text.split()
    if not words: return [0.0] * 7
    word_difficulties = [lexicon.get(word, 3.0) for word in words]
    features = [
        float(len(text)), float(len(words)),
        float(np.mean([len(w) for w in words]) if words else 0.0),
        float(np.mean(word_difficulties)), float(np.max(word_difficulties)),
        float(np.sum(np.array(word_difficulties) > 4)),
        float(len([w for w in words if w not in lexicon]) / len(words))
    ]
    return features

# --- 2. Initialize Models and Processors ---
repo_id = "FatimahEmadEldin/Constrained-Track-Document-Bassline-Readability-Arabertv2-d3tok-reg"
arabert_preprocessor = ArabertPreprocessor(model_name="aubmindlab/bert-large-arabertv2")
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)

# --- 3. Prepare Input Document and Lexicon ---
# For a real use case, load the full SAMER lexicon.
sample_lexicon = {'جملة': 2.5, 'عربية': 3.1, 'بسيطة': 1.8, 'النص': 2.8, 'طويل': 3.5}
document_text = "هذا مثال لجملة عربية بسيطة. هذا النص أطول قليلاً من المثال السابق."

# --- 4. Run the Full Pipeline ---
preprocessed_text = arabert_preprocessor.preprocess(document_text)
numerical_features_list = get_lexical_features(preprocessed_text, sample_lexicon)
numerical_features = torch.tensor([numerical_features_list], dtype=torch.float)

inputs = tokenizer(preprocessed_text, return_tensors="pt", padding=True, truncation=True, max_length=512)
inputs['extra_features'] = numerical_features # The model expects 'extra_features'

# --- 5. Perform Inference ---
model.eval()
with torch.no_grad():
    logits = model(**inputs)[1] # The model returns (loss, logits)

# --- 6. Process the Output ---
predicted_score = logits.item()
final_level = round(max(0, min(18, predicted_score))) + 1

print(f"Input Document: '{document_text}'")
print(f"Raw Regression Score: {predicted_score:.4f}")
print(f"Predicted Readability Level (1-19): {final_level}")

⚙️ Training Procedure

The system employs two distinct architectures based on the track's constraints:

Strict Track: This track uses a base regression model, CAMeL-Lab/readability-arabertv2-d3tok-reg, fine-tuned directly on the BAREC dataset.
Constrained and Open Tracks: These tracks utilize a hybrid model. This architecture combines the deep contextual understanding of the Transformer with explicit numerical features. The final representation for a sentence is created by concatenating the Transformer's [CLS] token embedding with a 7-dimensional vector of engineered lexical features derived from the SAMER lexicon.

A critical component of the system is its preprocessing pipeline, which leverages the CAMEL Tools d3tok format. The d3tok analyzer performs a deep morphological analysis by disambiguating words in context and then segmenting them into their constituent morphemes.

Frameworks

PyTorch
Hugging Face Transformers

📊 Evaluation Results

The models were evaluated on the blind test set provided by the BAREC organizers. The primary metric for evaluation is the Quadratic Weighted Kappa (QWK), which penalizes larger disagreements more severely.

Final Test Set Scores (QWK)

Track	Task	Dev (QWK)	Test (QWK)
Strict	Sentence	0.823	84.2
	Document	0.823*	79.9
Constrained	Sentence	0.810	82.9
	Document	0.835*	75.5
Open	Sentence	0.827	83.6
	Document	0.827*	79.2

*Document-level dev scores are based on the performance of the sentence-level model on the validation set.

📜 Citation

If you use the work, please cite the paper:

@inproceedings{eldin2025morphoarabia,
    title={{MorphoArabia at BAREC 2025 Shared Task: A Hybrid Architecture with Morphological Analysis for Arabic Readability Assessmen}},
    author={Eldin, Fatimah Mohamed Emad},
    year={2025},
    booktitle={Proceedings of the BAREC 2025 Shared Task},
    eprint={25XX.XXXXX},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}