Arabic Turn Detector Fine-Tuning - Data Collection, Preprocessing and Fine-Tuning
๐ Objective
The objective of this project is to fine-tune a model for Arabic end-of-utterance (EOU) detection. The model outputs the probability that a speaker will finish their turn based on the received transcription. This capability is crucial for enhancing the accuracy and responsiveness of real-time AI voice agents in Arabic-speaking contexts, thereby enabling more natural and effective conversational interactions.
๐ Project Structure
Hams's_Task/
โโโ notebook/
โโโ data_ingestion.ipynb # Data collection and fucntions for labeling the data correctly
โโโ data_preprocessing.ipynb # Data Preprocessing and EDA on the collected data
โโโ fine_tuning/
โโโ distilbert-base-multilingual-cased.py # Fine-Tuned Distillbert model
โโโ fine_tuning_SmolLM2.py # Fine-Tuned SmolLM2 (Choosed as its performace were good)
โโโ SmolLM2_Artifacts # Contain the artifacts of SmolLM2 model
โโโ distillbert_artifacts # Contain the Confusion Metric image of Distillbert
โโโ README.md
โโโ raw_data/
โโโ dataset.csv # This is the dataset taken from Github
โ โโโ labeled_data.csv # Initial processing operation performed on the GitHub dataset like removing unnecessarily columns etc and the dataset by the name of "labeled_data.csv".
โ โโโ yt_saudi_turns.csv # Collected YouTube transcripts data (Saudi Conversational)
โ โโโ final_raw_data.csv # Combined initial datasets (labeled_data.csv + yt_saudi_turns.csv)
โ โโโ correct_labeled_data.csv # Fixed labeling errors on the "final_raw_data.csv"
โ โโโ preprocessed_data.csv # Take the "correct_labeled_data.csv" and perform basic preprocessing and EDA
โ โโโ manual_data_collection/
โ โ โโโ manual_data.csv # Manually collected data
โ โโโ clean_data/
โ โโโ processed_data.csv # Take the "manual_data.csv" and "preprocessed_data.csv" and make the final cleaned dataset
โโโ model_data/ # Take the "processed_data.csv" and split into trainig and validation
โโโ training_data.csv # Training split (80%)
โโโ validation_data.csv # Validation split (20%)
๐ Data Processing Pipeline
Step 1: Initial Data Collection
Source: GitHub dataset (dataset.csv)
- โ Downloaded conversational Arabic dataset from GitHub
- โ Removed unnecessary columns
- โ
Applied custom labeling function to:
- Extract utterances from conversation columns
- Create segments from each utterance
- Label segments as "turn-end" or "not-turn-end"
- Treat each utterance independently
- Output:
raw_data/labeled_data.csv
Step 2: YouTube Data Collection
Source: Saudi conversational video transcripts
- โ Collected additional data from YouTube video transcripts
- โ Applied custom function with specific rules for turn-end detection
- โ Increased dataset size for better model performance
- Output:
raw_data/yt_saudi_turns.csv
Step 3: Data Combination
- โ Combined YouTube data with initial GitHub data
- โ Created larger unified dataset
- Output:
raw_data/final_raw_data.csv
Step 4: Label Correction
- โ Fixed incorrectly labeled text segments
- โ Applied custom function with specified correction rules
- โ Improved data quality and accuracy
- Output:
raw_data/correct_labeled_data.csv
Step 5: Basic Preprocessing
- โ Applied standard text preprocessing techniques
- โ Cleaned and normalized text data
- Output:
raw_data/preprocessed_data.csv
Step 6: Manual Data Integration
- โ
Added manually collected data from
raw_data/manual_data_collection/manual_data.csv - โ Combined with preprocessed data for final dataset
- Output:
raw_data/clean_data/processed_data.csv
Step 7: Train/Validation Split
- โ Split final dataset into training and validation sets
- โ Prepared data for model training
- Output:
model_data/training_data.csvmodel_data/validation_data.csv
๐ Data Statistics
| Dataset | Location | Description |
|---|---|---|
| Initial | raw_data/labeled_data.csv |
GitHub conversational data |
| YouTube | raw_data/yt_saudi_turns.csv |
Saudi video transcripts |
| Combined | raw_data/final_raw_data.csv |
Merged initial datasets |
| Corrected | raw_data/correct_labeled_data.csv |
Fixed labeling errors |
| Preprocessed | raw_data/preprocessed_data.csv |
Basic text cleaning |
| Manual | raw_data/manual_data_collection/manual_data.csv |
Hand-collected data |
| Final | raw_data/clean_data/processed_data.csv |
Complete cleaned dataset |
| Training | model_data/training_data.csv |
Model training data |
| Validation | model_data/validation_data.csv |
Model validation data |
๐ ๏ธ Custom Functions Used
1. Initial Labeling Function
- Extracts utterances from conversation columns
- Creates text segments
- Labels each segment for turn-end detection
2. YouTube Processing Function
- Processes video transcript data
- Applies turn-end detection rules
- Formats data consistently
3. Label Correction Function
- Identifies and fixes mislabeled segments
- Applies correction rules
- Improves dataset quality
4. Preprocessing Function
End-of-Utterance (EOU) Detection Model Comparison
This repository contains the results of fine-tuning two different models for End-of-Utterance (EOU) detection in Arabic text. The task involves binary classification to determine whether a given text represents the end of an utterance or not.
๐ Model Overview
We compared two models:
- DistilBERT-Base-Multilingual-Cased (10 epochs)
- SmolLM2-135M (5 epochs)
๐ฏ Task Description
End-of-Utterance Detection is a binary classification task where:
- Class 0 (No EOU): The text does not represent the end of an utterance
- Class 1 (EOU): The text represents the end of an utterance
๐ค Model 1: DistilBERT-Base-Multilingual-Cased
Training Progress
| Step | Training Loss | Validation Loss | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|---|---|
| 100 | 0.382900 | 0.301439 | 0.864865 | 0.864664 | 0.864865 | 0.863368 |
| 200 | 0.302400 | 0.269788 | 0.874266 | 0.882617 | 0.874266 | 0.875527 |
| 300 | 0.201800 | 0.231544 | 0.914219 | 0.915077 | 0.914219 | 0.913335 |
| 400 | 0.146300 | 0.202987 | 0.921269 | 0.922963 | 0.921269 | 0.921644 |
| 500 | 0.043100 | 0.224561 | 0.936545 | 0.936483 | 0.936545 | 0.936316 |
| 600 | 0.046800 | 0.230006 | 0.942421 | 0.942320 | 0.942421 | 0.942303 |
| 700 | 0.016800 | 0.218244 | 0.949471 | 0.949409 | 0.949471 | 0.949428 |
| 800 | 0.004300 | 0.245689 | 0.950646 | 0.951005 | 0.950646 | 0.950371 |
| 900 | 0.000400 | 0.236911 | 0.950646 | 0.950692 | 0.950646 | 0.950468 |
| 1000 | 0.001400 | 0.234034 | 0.950646 | 0.950631 | 0.950646 | 0.950500 |
Final Evaluation Results
| Metric | Score |
|---|---|
| Overall Accuracy | 0.9506 |
| Overall Precision | 0.9506 |
| Overall Recall | 0.9506 |
| Overall F1-Score | 0.9505 |
Per-Class Performance
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| Class 0 (No EOU) | 0.9511 | 0.9693 | 0.9602 |
| Class 1 (EOU) | 0.9498 | 0.9210 | 0.9352 |
Confusion Matrix
Model Analysis
Model Analysis
โ ๏ธ Issue Identified: The validation loss starts increasing after epoch 4 while training loss continues to decrease, indicating overfitting. Although accuracy and other metrics improve, the model is not generalizing well to unseen data.
๐ Model 2: SmolLM2-135M (Recommended)
Training Progress
| Step | Training Loss | Validation Loss | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|---|---|
| 200 | 0.392800 | 0.312217 | 0.870740 | 0.870495 | 0.870740 | 0.869458 |
| 400 | 0.187000 | 0.229390 | 0.914219 | 0.922040 | 0.914219 | 0.912031 |
| 600 | 0.085800 | 0.260901 | 0.930670 | 0.932843 | 0.930670 | 0.931060 |
| 800 | 0.034100 | 0.246878 | 0.942421 | 0.942771 | 0.942421 | 0.942529 |
| 1000 | 0.010300 | 0.222678 | 0.949471 | 0.949445 | 0.949471 | 0.949457 |
Final Evaluation Results
| Metric | Score |
|---|---|
| Overall Accuracy | 0.9495 |
| Overall Precision | 0.9494 |
| Overall Recall | 0.9495 |
| Overall F1-Score | 0.9495 |
Per-Class Performance
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| Class 0 (No EOU) | 0.9579 | 0.9598 | 0.9589 |
| Class 1 (EOU) | 0.9360 | 0.9331 | 0.9346 |
Confusion Matrix
โ Excellent Performance: The validation loss consistently decreases while accuracy and other metrics improve, indicating good generalization. This model shows no signs of overfitting and is recommended for production use.
๐งช Test Results Comparison
Sample Predictions
| Arabic Text | DistilBERT Prediction | SmolLM2 Prediction |
|---|---|---|
| ุทูุจุ ุจุณ ูุงุฒู ูุชูู ุฃูู. | No EOU (0.9864) | No EOU (0.9999) |
| ูู ุชูุฏุฑ ุชูุตููู ุจูุฑุงุ | EOU (0.9996) | EOU (1.0000) |
| ุฃูุง ุญุงููุชุ ููู ู ุง ููู ุช ุงูุฏุฑุณ. | EOU (0.9992) | EOU (0.9900) |
| ุจุณ ุงูุช ู ุง ููุช ูู ู ุชู ูุจุฏุฃ | No EOU (0.8094) | No EOU (0.9968) |
| ุดูุฑุงู ูุซูุฑ ุนูู ุงูู ุณุงุนุฏุฉ. | EOU (0.9994) | EOU (0.9999) |
| ุทูุจ ููู ู ุจุนุฏููุ | EOU (0.9996) | EOU (1.0000) |
| ูุนูู ุฃูุง ููุช ุฃูุชุธุฑ ู ูู ุชุฑุฏ ุนูู | EOU (0.7838) | No EOU (0.9912) |
| ุฃูุง ุขุณู ุฅุฐุง ุฒุนูุชู. | EOU (0.9674) | No EOU (0.9537) |
| ูุง ุชูุณู ุชุฑุฌุน ุงูู ูุชุงุญ ุจุนุฏูู | No EOU (0.9992) | No EOU (0.9998) |
| ูุฐุง ุงูุดูุก ู ุง ุชููุนุช ูุตูุฑ! | EOU (0.9996) | EOU (1.0000) |
| ูู ูุงู ูู ุฃูู ูุงุฒู ููุชุจู | EOU (0.9993) | EOU (0.7345) |
| ุฅูุด ุฑุฃูู ูุทูุจ ุจูุชุฒุงุ | EOU (0.9996) | EOU (1.0000) |
| ุฃุตูุงู ู ุง ูุงู ุงูู ูุฑูุถ ูุฌู | No EOU (0.9970) | EOU (0.9608) |
| ุทูุจุ ููู ู ุงูุญูู ููุง ุจุนุฏููุ | EOU (0.9996) | EOU (1.0000) |
| ุฃูุง ู ุง ุฃูุฏุฑ ุฃูุฑุฑ ูุญุงูู | EOU (0.9991) | No EOU (0.9815) |
| ูุงููู ู ุง ููุช ุฃูุตุฏ. | No EOU (0.9991) | No EOU (1.0000) |
๐ Performance Summary
| Model | Accuracy | Precision | Recall | F1-Score | Overfitting |
|---|---|---|---|---|---|
| SmolLM2-135M | 0.9495 | 0.9494 | 0.9495 | 0.9495 | โ ๏ธ No |
| DistilBERT-Multilingual | 0.9506 | 0.9506 | 0.9506 | 0.9505 | โ Yes |
๐ Conclusion
SmolLM2-135M is the recommended model for this task because:
- Better Generalization: No signs of overfitting with consistently decreasing validation loss
- Slightly Better Performance: Marginal improvement in all metrics
- More Reliable: Stable training behavior and consistent predictions
- Efficiency: Achieved comparable results with fewer epochs (5 vs 10)
๐ง Usage
To use the trained SmolLM2-135M model:
# Load the model
model = SmolLM2Model.from_pretrained('path/to/smollm2-eou-detector')
# Make predictions
text = "ูู ุชูุฏุฑ ุชูุตููู ุจูุฑุงุ"
prediction = model.predict(text)
print(f"Prediction: {prediction}")
- Downloads last month
- 1
Model tree for Fahim000/SmolLM2-finetuned
Base model
HuggingFaceTB/SmolLM2-135M
