Arabic Turn Detector Fine-Tuning - Data Collection, Preprocessing and Fine-Tuning

๐Ÿ“‹ Objective

The objective of this project is to fine-tune a model for Arabic end-of-utterance (EOU) detection. The model outputs the probability that a speaker will finish their turn based on the received transcription. This capability is crucial for enhancing the accuracy and responsiveness of real-time AI voice agents in Arabic-speaking contexts, thereby enabling more natural and effective conversational interactions.

๐Ÿ“ Project Structure

Hams's_Task/
โ”œโ”€โ”€ notebook/
    โ”œโ”€โ”€ data_ingestion.ipynb          # Data collection and fucntions for labeling the data correctly
    โ”œโ”€โ”€ data_preprocessing.ipynb      # Data Preprocessing and EDA on the collected data
โ”œโ”€โ”€ fine_tuning/
    โ”œโ”€โ”€ distilbert-base-multilingual-cased.py      # Fine-Tuned Distillbert model
    โ”œโ”€โ”€ fine_tuning_SmolLM2.py                     # Fine-Tuned SmolLM2 (Choosed as its performace were good)
    โ”œโ”€โ”€ SmolLM2_Artifacts                          # Contain the artifacts of SmolLM2 model
    โ”œโ”€โ”€ distillbert_artifacts                      # Contain the Confusion Metric image of Distillbert
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ raw_data/
    โ”œโ”€โ”€ dataset.csv                   # This is the dataset taken from Github
โ”‚   โ”œโ”€โ”€ labeled_data.csv              # Initial processing operation performed on the GitHub dataset like removing unnecessarily columns etc and the dataset by the name of "labeled_data.csv".
โ”‚   โ”œโ”€โ”€ yt_saudi_turns.csv           # Collected YouTube transcripts data (Saudi Conversational)
โ”‚   โ”œโ”€โ”€ final_raw_data.csv           # Combined initial datasets (labeled_data.csv + yt_saudi_turns.csv)
โ”‚   โ”œโ”€โ”€ correct_labeled_data.csv     # Fixed labeling errors on the "final_raw_data.csv"
โ”‚   โ”œโ”€โ”€ preprocessed_data.csv        # Take the "correct_labeled_data.csv" and perform basic preprocessing and EDA
โ”‚   โ”œโ”€โ”€ manual_data_collection/
โ”‚   โ”‚   โ””โ”€โ”€ manual_data.csv          # Manually collected data
โ”‚   โ””โ”€โ”€ clean_data/
โ”‚       โ””โ”€โ”€ processed_data.csv       # Take the "manual_data.csv" and "preprocessed_data.csv" and make the final cleaned dataset
โ””โ”€โ”€ model_data/                      # Take the "processed_data.csv" and split into trainig and validation
    โ”œโ”€โ”€ training_data.csv            # Training split (80%)
    โ””โ”€โ”€ validation_data.csv          # Validation split (20%)

๐Ÿ”„ Data Processing Pipeline

Step 1: Initial Data Collection

Source: GitHub dataset (dataset.csv)

  • โœ… Downloaded conversational Arabic dataset from GitHub
  • โœ… Removed unnecessary columns
  • โœ… Applied custom labeling function to:
    • Extract utterances from conversation columns
    • Create segments from each utterance
    • Label segments as "turn-end" or "not-turn-end"
    • Treat each utterance independently
  • Output: raw_data/labeled_data.csv

Step 2: YouTube Data Collection

Source: Saudi conversational video transcripts

  • โœ… Collected additional data from YouTube video transcripts
  • โœ… Applied custom function with specific rules for turn-end detection
  • โœ… Increased dataset size for better model performance
  • Output: raw_data/yt_saudi_turns.csv

Step 3: Data Combination

  • โœ… Combined YouTube data with initial GitHub data
  • โœ… Created larger unified dataset
  • Output: raw_data/final_raw_data.csv

Step 4: Label Correction

  • โœ… Fixed incorrectly labeled text segments
  • โœ… Applied custom function with specified correction rules
  • โœ… Improved data quality and accuracy
  • Output: raw_data/correct_labeled_data.csv

Step 5: Basic Preprocessing

  • โœ… Applied standard text preprocessing techniques
  • โœ… Cleaned and normalized text data
  • Output: raw_data/preprocessed_data.csv

Step 6: Manual Data Integration

  • โœ… Added manually collected data from raw_data/manual_data_collection/manual_data.csv
  • โœ… Combined with preprocessed data for final dataset
  • Output: raw_data/clean_data/processed_data.csv

Step 7: Train/Validation Split

  • โœ… Split final dataset into training and validation sets
  • โœ… Prepared data for model training
  • Output:
    • model_data/training_data.csv
    • model_data/validation_data.csv

๐Ÿ“Š Data Statistics

Dataset Location Description
Initial raw_data/labeled_data.csv GitHub conversational data
YouTube raw_data/yt_saudi_turns.csv Saudi video transcripts
Combined raw_data/final_raw_data.csv Merged initial datasets
Corrected raw_data/correct_labeled_data.csv Fixed labeling errors
Preprocessed raw_data/preprocessed_data.csv Basic text cleaning
Manual raw_data/manual_data_collection/manual_data.csv Hand-collected data
Final raw_data/clean_data/processed_data.csv Complete cleaned dataset
Training model_data/training_data.csv Model training data
Validation model_data/validation_data.csv Model validation data

๐Ÿ› ๏ธ Custom Functions Used

1. Initial Labeling Function

  • Extracts utterances from conversation columns
  • Creates text segments
  • Labels each segment for turn-end detection

2. YouTube Processing Function

  • Processes video transcript data
  • Applies turn-end detection rules
  • Formats data consistently

3. Label Correction Function

  • Identifies and fixes mislabeled segments
  • Applies correction rules
  • Improves dataset quality

4. Preprocessing Function

  • Cleans and normalizes text
  • Prepares data for model training
  • Handles Arabic text specifics

End-of-Utterance (EOU) Detection Model Comparison

This repository contains the results of fine-tuning two different models for End-of-Utterance (EOU) detection in Arabic text. The task involves binary classification to determine whether a given text represents the end of an utterance or not.

๐Ÿ“Š Model Overview

We compared two models:

  1. DistilBERT-Base-Multilingual-Cased (10 epochs)
  2. SmolLM2-135M (5 epochs)

๐ŸŽฏ Task Description

End-of-Utterance Detection is a binary classification task where:

  • Class 0 (No EOU): The text does not represent the end of an utterance
  • Class 1 (EOU): The text represents the end of an utterance

๐Ÿค– Model 1: DistilBERT-Base-Multilingual-Cased

Training Progress

Step Training Loss Validation Loss Accuracy Precision Recall F1 Score
100 0.382900 0.301439 0.864865 0.864664 0.864865 0.863368
200 0.302400 0.269788 0.874266 0.882617 0.874266 0.875527
300 0.201800 0.231544 0.914219 0.915077 0.914219 0.913335
400 0.146300 0.202987 0.921269 0.922963 0.921269 0.921644
500 0.043100 0.224561 0.936545 0.936483 0.936545 0.936316
600 0.046800 0.230006 0.942421 0.942320 0.942421 0.942303
700 0.016800 0.218244 0.949471 0.949409 0.949471 0.949428
800 0.004300 0.245689 0.950646 0.951005 0.950646 0.950371
900 0.000400 0.236911 0.950646 0.950692 0.950646 0.950468
1000 0.001400 0.234034 0.950646 0.950631 0.950646 0.950500

Final Evaluation Results

Metric Score
Overall Accuracy 0.9506
Overall Precision 0.9506
Overall Recall 0.9506
Overall F1-Score 0.9505

Per-Class Performance

Class Precision Recall F1-Score
Class 0 (No EOU) 0.9511 0.9693 0.9602
Class 1 (EOU) 0.9498 0.9210 0.9352

Confusion Matrix

DistilBERT Confusion Matrix

Model Analysis

Model Analysis

โš ๏ธ Issue Identified: The validation loss starts increasing after epoch 4 while training loss continues to decrease, indicating overfitting. Although accuracy and other metrics improve, the model is not generalizing well to unseen data.

๐Ÿš€ Model 2: SmolLM2-135M (Recommended)

Training Progress

Step Training Loss Validation Loss Accuracy Precision Recall F1 Score
200 0.392800 0.312217 0.870740 0.870495 0.870740 0.869458
400 0.187000 0.229390 0.914219 0.922040 0.914219 0.912031
600 0.085800 0.260901 0.930670 0.932843 0.930670 0.931060
800 0.034100 0.246878 0.942421 0.942771 0.942421 0.942529
1000 0.010300 0.222678 0.949471 0.949445 0.949471 0.949457

Final Evaluation Results

Metric Score
Overall Accuracy 0.9495
Overall Precision 0.9494
Overall Recall 0.9495
Overall F1-Score 0.9495

Per-Class Performance

Class Precision Recall F1-Score
Class 0 (No EOU) 0.9579 0.9598 0.9589
Class 1 (EOU) 0.9360 0.9331 0.9346

Confusion Matrix

SmolLM2 Confusion Matrix

โœ… Excellent Performance: The validation loss consistently decreases while accuracy and other metrics improve, indicating good generalization. This model shows no signs of overfitting and is recommended for production use.


๐Ÿงช Test Results Comparison

Sample Predictions

Arabic Text DistilBERT Prediction SmolLM2 Prediction
ุทูŠุจุŒ ุจุณ ู„ุงุฒู… ู†ุชูู‚ ุฃูˆู„. No EOU (0.9864) No EOU (0.9999)
ู‡ู„ ุชู‚ุฏุฑ ุชูˆุตู„ู†ูŠ ุจูƒุฑุงุŸ EOU (0.9996) EOU (1.0000)
ุฃู†ุง ุญุงูˆู„ุชุŒ ู„ูƒู† ู…ุง ูู‡ู…ุช ุงู„ุฏุฑุณ. EOU (0.9992) EOU (0.9900)
ุจุณ ุงู†ุช ู…ุง ู‚ู„ุช ู„ูŠ ู…ุชู‰ ู†ุจุฏุฃ No EOU (0.8094) No EOU (0.9968)
ุดูƒุฑุงู‹ ูƒุซูŠุฑ ุนู„ู‰ ุงู„ู…ุณุงุนุฏุฉ. EOU (0.9994) EOU (0.9999)
ุทูŠุจ ู†ูƒู…ู„ ุจุนุฏูŠู†ุŸ EOU (0.9996) EOU (1.0000)
ูŠุนู†ูŠ ุฃู†ุง ูƒู†ุช ุฃู†ุชุธุฑ ู…ู†ูƒ ุชุฑุฏ ุนู„ูŠ EOU (0.7838) No EOU (0.9912)
ุฃู†ุง ุขุณู ุฅุฐุง ุฒุนู„ุชูƒ. EOU (0.9674) No EOU (0.9537)
ู„ุง ุชู†ุณู‰ ุชุฑุฌุน ุงู„ู…ูุชุงุญ ุจุนุฏูŠู† No EOU (0.9992) No EOU (0.9998)
ู‡ุฐุง ุงู„ุดูŠุก ู…ุง ุชูˆู‚ุนุช ูŠุตูŠุฑ! EOU (0.9996) EOU (1.0000)
ู‡ูˆ ู‚ุงู„ ู„ูŠ ุฃู†ูˆ ู„ุงุฒู… ู†ู†ุชุจู‡ EOU (0.9993) EOU (0.7345)
ุฅูŠุด ุฑุฃูŠูƒ ู†ุทู„ุจ ุจูŠุชุฒุงุŸ EOU (0.9996) EOU (1.0000)
ุฃุตู„ุงู‹ ู…ุง ูƒุงู† ุงู„ู…ูุฑูˆุถ ู†ุฌูŠ No EOU (0.9970) EOU (0.9608)
ุทูŠุจุŒ ู†ูƒู…ู„ ุงู„ุญูŠู† ูˆู„ุง ุจุนุฏูŠู†ุŸ EOU (0.9996) EOU (1.0000)
ุฃู†ุง ู…ุง ุฃู‚ุฏุฑ ุฃู‚ุฑุฑ ู„ุญุงู„ูŠ EOU (0.9991) No EOU (0.9815)
ูˆุงู„ู„ู‡ ู…ุง ูƒู†ุช ุฃู‚ุตุฏ. No EOU (0.9991) No EOU (1.0000)

๐Ÿ“ˆ Performance Summary

Model Accuracy Precision Recall F1-Score Overfitting
SmolLM2-135M 0.9495 0.9494 0.9495 0.9495 โš ๏ธ No
DistilBERT-Multilingual 0.9506 0.9506 0.9506 0.9505 โœ… Yes

๐Ÿ† Conclusion

SmolLM2-135M is the recommended model for this task because:

  1. Better Generalization: No signs of overfitting with consistently decreasing validation loss
  2. Slightly Better Performance: Marginal improvement in all metrics
  3. More Reliable: Stable training behavior and consistent predictions
  4. Efficiency: Achieved comparable results with fewer epochs (5 vs 10)

๐Ÿ”ง Usage

To use the trained SmolLM2-135M model:

# Load the model
model = SmolLM2Model.from_pretrained('path/to/smollm2-eou-detector')

# Make predictions
text = "ู‡ู„ ุชู‚ุฏุฑ ุชูˆุตู„ู†ูŠ ุจูƒุฑุงุŸ"
prediction = model.predict(text)
print(f"Prediction: {prediction}")
Downloads last month
1
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Fahim000/SmolLM2-finetuned

Finetuned
(213)
this model