Arabic Turn Detector Fine-Tuning - Data Collection, Preprocessing and Fine-Tuning

📋 Objective

The objective of this project is to fine-tune a model for Arabic end-of-utterance (EOU) detection. The model outputs the probability that a speaker will finish their turn based on the received transcription. This capability is crucial for enhancing the accuracy and responsiveness of real-time AI voice agents in Arabic-speaking contexts, thereby enabling more natural and effective conversational interactions.

📁 Project Structure

Hams's_Task/
├── notebook/
    ├── data_ingestion.ipynb          # Data collection and fucntions for labeling the data correctly
    ├── data_preprocessing.ipynb      # Data Preprocessing and EDA on the collected data
├── fine_tuning/
    ├── distilbert-base-multilingual-cased.py      # Fine-Tuned Distillbert model
    ├── fine_tuning_SmolLM2.py                     # Fine-Tuned SmolLM2 (Choosed as its performace were good)
    ├── SmolLM2_Artifacts                          # Contain the artifacts of SmolLM2 model
    ├── distillbert_artifacts                      # Contain the Confusion Metric image of Distillbert
├── README.md
├── raw_data/
    ├── dataset.csv                   # This is the dataset taken from Github
│   ├── labeled_data.csv              # Initial processing operation performed on the GitHub dataset like removing unnecessarily columns etc and the dataset by the name of "labeled_data.csv".
│   ├── yt_saudi_turns.csv           # Collected YouTube transcripts data (Saudi Conversational)
│   ├── final_raw_data.csv           # Combined initial datasets (labeled_data.csv + yt_saudi_turns.csv)
│   ├── correct_labeled_data.csv     # Fixed labeling errors on the "final_raw_data.csv"
│   ├── preprocessed_data.csv        # Take the "correct_labeled_data.csv" and perform basic preprocessing and EDA
│   ├── manual_data_collection/
│   │   └── manual_data.csv          # Manually collected data
│   └── clean_data/
│       └── processed_data.csv       # Take the "manual_data.csv" and "preprocessed_data.csv" and make the final cleaned dataset
└── model_data/                      # Take the "processed_data.csv" and split into trainig and validation
    ├── training_data.csv            # Training split (80%)
    └── validation_data.csv          # Validation split (20%)

🔄 Data Processing Pipeline

Step 1: Initial Data Collection

Source: GitHub dataset (dataset.csv)

✅ Downloaded conversational Arabic dataset from GitHub
✅ Removed unnecessary columns
✅ Applied custom labeling function to:
- Extract utterances from conversation columns
- Create segments from each utterance
- Label segments as "turn-end" or "not-turn-end"
- Treat each utterance independently
Output: raw_data/labeled_data.csv

Step 2: YouTube Data Collection

Source: Saudi conversational video transcripts

✅ Collected additional data from YouTube video transcripts
✅ Applied custom function with specific rules for turn-end detection
✅ Increased dataset size for better model performance
Output: raw_data/yt_saudi_turns.csv

Step 3: Data Combination

✅ Combined YouTube data with initial GitHub data
✅ Created larger unified dataset
Output: raw_data/final_raw_data.csv

Step 4: Label Correction

✅ Fixed incorrectly labeled text segments
✅ Applied custom function with specified correction rules
✅ Improved data quality and accuracy
Output: raw_data/correct_labeled_data.csv

Step 5: Basic Preprocessing

✅ Applied standard text preprocessing techniques
✅ Cleaned and normalized text data
Output: raw_data/preprocessed_data.csv

Step 6: Manual Data Integration

✅ Added manually collected data from raw_data/manual_data_collection/manual_data.csv
✅ Combined with preprocessed data for final dataset
Output: raw_data/clean_data/processed_data.csv

Step 7: Train/Validation Split

✅ Split final dataset into training and validation sets
✅ Prepared data for model training
Output:
- model_data/training_data.csv
- model_data/validation_data.csv

📊 Data Statistics

Dataset	Location	Description
Initial	`raw_data/labeled_data.csv`	GitHub conversational data
YouTube	`raw_data/yt_saudi_turns.csv`	Saudi video transcripts
Combined	`raw_data/final_raw_data.csv`	Merged initial datasets
Corrected	`raw_data/correct_labeled_data.csv`	Fixed labeling errors
Preprocessed	`raw_data/preprocessed_data.csv`	Basic text cleaning
Manual	`raw_data/manual_data_collection/manual_data.csv`	Hand-collected data
Final	`raw_data/clean_data/processed_data.csv`	Complete cleaned dataset
Training	`model_data/training_data.csv`	Model training data
Validation	`model_data/validation_data.csv`	Model validation data

🛠️ Custom Functions Used

1. Initial Labeling Function

Extracts utterances from conversation columns
Creates text segments
Labels each segment for turn-end detection

2. YouTube Processing Function

Processes video transcript data
Applies turn-end detection rules
Formats data consistently

3. Label Correction Function

Identifies and fixes mislabeled segments
Applies correction rules
Improves dataset quality

4. Preprocessing Function

Cleans and normalizes text
Prepares data for model training
Handles Arabic text specifics

End-of-Utterance (EOU) Detection Model Comparison

This repository contains the results of fine-tuning two different models for End-of-Utterance (EOU) detection in Arabic text. The task involves binary classification to determine whether a given text represents the end of an utterance or not.

📊 Model Overview

We compared two models:

DistilBERT-Base-Multilingual-Cased (10 epochs)
SmolLM2-135M (5 epochs)

🎯 Task Description

End-of-Utterance Detection is a binary classification task where:

Class 0 (No EOU): The text does not represent the end of an utterance
Class 1 (EOU): The text represents the end of an utterance

🤖 Model 1: DistilBERT-Base-Multilingual-Cased

Training Progress

Step	Training Loss	Validation Loss	Accuracy	Precision	Recall	F1 Score
100	0.382900	0.301439	0.864865	0.864664	0.864865	0.863368
200	0.302400	0.269788	0.874266	0.882617	0.874266	0.875527
300	0.201800	0.231544	0.914219	0.915077	0.914219	0.913335
400	0.146300	0.202987	0.921269	0.922963	0.921269	0.921644
500	0.043100	0.224561	0.936545	0.936483	0.936545	0.936316
600	0.046800	0.230006	0.942421	0.942320	0.942421	0.942303
700	0.016800	0.218244	0.949471	0.949409	0.949471	0.949428
800	0.004300	0.245689	0.950646	0.951005	0.950646	0.950371
900	0.000400	0.236911	0.950646	0.950692	0.950646	0.950468
1000	0.001400	0.234034	0.950646	0.950631	0.950646	0.950500

Final Evaluation Results

Metric	Score
Overall Accuracy	0.9506
Overall Precision	0.9506
Overall Recall	0.9506
Overall F1-Score	0.9505

Per-Class Performance

Class	Precision	Recall	F1-Score
Class 0 (No EOU)	0.9511	0.9693	0.9602
Class 1 (EOU)	0.9498	0.9210	0.9352

Confusion Matrix

Model Analysis

⚠️ Issue Identified: The validation loss starts increasing after epoch 4 while training loss continues to decrease, indicating overfitting. Although accuracy and other metrics improve, the model is not generalizing well to unseen data.

🚀 Model 2: SmolLM2-135M (Recommended)

Training Progress

Step	Training Loss	Validation Loss	Accuracy	Precision	Recall	F1 Score
200	0.392800	0.312217	0.870740	0.870495	0.870740	0.869458
400	0.187000	0.229390	0.914219	0.922040	0.914219	0.912031
600	0.085800	0.260901	0.930670	0.932843	0.930670	0.931060
800	0.034100	0.246878	0.942421	0.942771	0.942421	0.942529
1000	0.010300	0.222678	0.949471	0.949445	0.949471	0.949457

Final Evaluation Results

Metric	Score
Overall Accuracy	0.9495
Overall Precision	0.9494
Overall Recall	0.9495
Overall F1-Score	0.9495

Per-Class Performance

Class	Precision	Recall	F1-Score
Class 0 (No EOU)	0.9579	0.9598	0.9589
Class 1 (EOU)	0.9360	0.9331	0.9346

Confusion Matrix

✅ Excellent Performance: The validation loss consistently decreases while accuracy and other metrics improve, indicating good generalization. This model shows no signs of overfitting and is recommended for production use.

🧪 Test Results Comparison

Sample Predictions

Arabic Text	DistilBERT Prediction	SmolLM2 Prediction
طيب، بس لازم نتفق أول.	No EOU (0.9864)	No EOU (0.9999)
هل تقدر توصلني بكرا؟	EOU (0.9996)	EOU (1.0000)
أنا حاولت، لكن ما فهمت الدرس.	EOU (0.9992)	EOU (0.9900)
بس انت ما قلت لي متى نبدأ	No EOU (0.8094)	No EOU (0.9968)
شكراً كثير على المساعدة.	EOU (0.9994)	EOU (0.9999)
طيب نكمل بعدين؟	EOU (0.9996)	EOU (1.0000)
يعني أنا كنت أنتظر منك ترد علي	EOU (0.7838)	No EOU (0.9912)
أنا آسف إذا زعلتك.	EOU (0.9674)	No EOU (0.9537)
لا تنسى ترجع المفتاح بعدين	No EOU (0.9992)	No EOU (0.9998)
هذا الشيء ما توقعت يصير!	EOU (0.9996)	EOU (1.0000)
هو قال لي أنو لازم ننتبه	EOU (0.9993)	EOU (0.7345)
إيش رأيك نطلب بيتزا؟	EOU (0.9996)	EOU (1.0000)
أصلاً ما كان المفروض نجي	No EOU (0.9970)	EOU (0.9608)
طيب، نكمل الحين ولا بعدين؟	EOU (0.9996)	EOU (1.0000)
أنا ما أقدر أقرر لحالي	EOU (0.9991)	No EOU (0.9815)
والله ما كنت أقصد.	No EOU (0.9991)	No EOU (1.0000)

📈 Performance Summary

Model	Accuracy	Precision	Recall	F1-Score	Overfitting
SmolLM2-135M	0.9495	0.9494	0.9495	0.9495	⚠️ No
DistilBERT-Multilingual	0.9506	0.9506	0.9506	0.9505	✅ Yes

🏆 Conclusion

SmolLM2-135M is the recommended model for this task because:

Better Generalization: No signs of overfitting with consistently decreasing validation loss
Slightly Better Performance: Marginal improvement in all metrics
More Reliable: Stable training behavior and consistent predictions
Efficiency: Achieved comparable results with fewer epochs (5 vs 10)

🔧 Usage

To use the trained SmolLM2-135M model:

# Load the model
model = SmolLM2Model.from_pretrained('path/to/smollm2-eou-detector')

# Make predictions
text = "هل تقدر توصلني بكرا؟"
prediction = model.predict(text)
print(f"Prediction: {prediction}")

Downloads last month: 1

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for Fahim000/SmolLM2-finetuned

Base model

HuggingFaceTB/SmolLM2-135M

Quantized

HuggingFaceTB/SmolLM2-135M-Instruct

Finetuned

(213)

this model

Arabic Turn Detector Fine-Tuning - Data Collection, Preprocessing and Fine-Tuning

📋 Objective

📁 Project Structure

🔄 Data Processing Pipeline

Step 1: Initial Data Collection

Step 2: YouTube Data Collection

Step 3: Data Combination

Step 4: Label Correction

Step 5: Basic Preprocessing

Step 6: Manual Data Integration

Step 7: Train/Validation Split

📊 Data Statistics

🛠️ Custom Functions Used

1. Initial Labeling Function

2. YouTube Processing Function

3. Label Correction Function

4. Preprocessing Function

Handles Arabic text specifics

End-of-Utterance (EOU) Detection Model Comparison

📊 Model Overview

🎯 Task Description

🤖 Model 1: DistilBERT-Base-Multilingual-Cased

Training Progress

Final Evaluation Results

Per-Class Performance

Confusion Matrix

Model Analysis

Model Analysis

⚠️ Issue Identified: The validation loss starts increasing after epoch 4 while training loss continues to decrease, indicating overfitting. Although accuracy and other metrics improve, the model is not generalizing well to unseen data.

🚀 Model 2: SmolLM2-135M (Recommended)

Training Progress

Final Evaluation Results

Per-Class Performance

Confusion Matrix

🧪 Test Results Comparison

Sample Predictions

📈 Performance Summary

🏆 Conclusion

🔧 Usage

Model tree for Fahim000/SmolLM2-finetuned