π Kiswahili Sahihi ASR Adapted 3
π― Breakthrough Performance in Swahili Speech Recognition
π Performance Evolution: Complete Version History
| Version |
Best WER |
Best CER |
Training Data |
Key Achievement |
| Adapted 1 |
11.42% |
4.03% |
3,758 samples |
Initial PEFT Implementation |
| Adapted 2 |
11.09% |
3.98% |
3,758 samples |
Extended Training & Optimization |
| Adapted 3 |
6.70% |
2.90% |
8,912 samples |
Major Accuracy Breakthrough |
π― Performance Improvements
- vs Adapted 1: 41% WER reduction (11.42% β 6.70%)
- vs Adapted 2: 40% WER reduction (11.09% β 6.70%)
- CER Improvement: 27% reduction from both previous versions
ποΈ Model Architecture
- Base Model:
keystats/kiswahili_sahihi_asr
- Fine-tuning Method: PEFT with LoRA (Parameter-Efficient Fine-Tuning)
- Trainable Parameters: 2.36M (0.31% of total 766M)
- Target Modules:
q_proj, v_proj
- Tokenizer Vocabulary: 51,866 tokens
π― What Makes Adapted 3 Superior
π Dramatic Accuracy Improvements
- 41% lower WER compared to Adapted 1
- 40% lower WER compared to Adapted 2
- 27% lower CER across both previous versions
- Exceptional training stability with consistent convergence
π£οΈ Expanded & Enhanced Training Data
- 137% more training data (3,758 β 8,912 samples)
- Integration of
keystats/swahili_asr_data for diverse Swahili speech patterns
- Better quality validation set (484 vs 77 samples in v1/v2)
- Improved data balancing across different Swahili accents and domains
β‘ Optimized Training Strategy
- Refined hyperparameters based on v1/v2 learnings
- Enhanced gradient accumulation for stable updates
- Improved noise augmentation with better urban noise sampling
- Optimized learning rate scheduling for faster convergence
π Detailed Training Performance
Adapted 3 Complete Training Progress
| Step |
Training Loss |
Validation Loss |
WER (%) |
CER (%) |
| 400 |
0.2780 |
0.2711 |
7.92 |
3.10 |
| 800 |
0.2192 |
0.2378 |
7.18 |
3.01 |
| 1200 |
0.1982 |
0.2153 |
6.85 |
2.96 |
| 1600 |
0.1731 |
0.2046 |
6.70 |
2.90 |
| 2000 |
0.1968 |
0.1996 |
6.99 |
3.01 |
| 2400 |
0.1565 |
0.1939 |
6.80 |
2.94 |
| 2800 |
0.1830 |
0.1945 |
7.23 |
3.13 |
| 3200 |
0.1598 |
0.1905 |
6.87 |
2.98 |
π Performance Comparison Across Versions
WER Progression Timeline:
Adapted 1: 16.23% β 11.42% (Final) - Initial PEFT
Adapted 2: 16.23% β 11.09% (Final) - Extended training
Adapted 3: 7.92% β 6.87% (Final) - π Enhanced data + optimization
Training Stability Analysis:
Adapted 1: WER range 11.42-16.23% (fluctuating)
Adapted 2: WER range 11.09-16.39% (improved but variable)
Adapted 3: WER range 6.70-7.92% (β
Highly stable)
π οΈ Technical Specifications
Enhanced Training Configuration
training_args = Seq2SeqTrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
learning_rate=1e-5,
warmup_steps=500,
num_train_epochs=3,
fp16=True,
gradient_checkpointing=True,
eval_steps=400,
save_steps=400,
logging_steps=400,
load_best_model_at_end=True,
metric_for_best_model="wer"
)
Expanded Dataset Composition
- Total Training Samples: 8,912 (137% increase from v1/v2)
- Total Validation Samples: 484 (528% increase from v1/v2)
- Primary Data Sources:
Sunbird/salt (studio-swa configuration) - Foundation
keystats/swahili_asr_data - Critical for performance boost
Sunbird/urban-noise-uganda-61k - Enhanced noise robustness
Advanced Data Augmentation
- Intelligent Noise Injection: 50% probability with curated urban samples
- Dynamic Amplitude Variation: Up to 50% relative noise amplitude
- Smart Audio Chunking: Optimized for various audio durations
- Enhanced Attention Masking: Better handling of padded sequences
π Usage Example
import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel, PeftConfig
adapter_path = "keystats/kiswahili_sahihi_asr_adapted_3"
processor = WhisperProcessor.from_pretrained(adapter_path)
peft_config = PeftConfig.from_pretrained(adapter_path)
base_model = WhisperForConditionalGeneration.from_pretrained(
peft_config.base_model_name_or_path,
ignore_mismatched_sizes=True,
)
base_model.resize_token_embeddings(len(processor.tokenizer))
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.merge_and_unload()
def transcribe_swahili(audio_path):
audio, sr = librosa.load(audio_path, sr=16000, mono=True)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
num_beams=2,
repetition_penalty=1.1
)
return processor.batch_decode(outputs, skip_special_tokens=True)[0]
transcription = transcribe_swahili("swahili_audio.wav")
print(f"π― Enhanced Transcription: {transcription}")
π‘ Why Adapted 3 is the Clear Choice
π― For Production Applications
- 41% higher accuracy than original adapted version
- Proven stability for reliable deployment
- Better ROI with reduced post-processing needs
π For Research & Development
- Demonstrates PEFT scalability for low-resource languages
- Comprehensive benchmarking across three model versions
- Reproducible training methodology
π For the Swahili Ecosystem
- Near-human transcription accuracy for most applications
- Support for diverse accents and speaking styles
- Accelerated digital inclusion for Swahili speakers
π Real-World Impact
The 41% accuracy improvement in Adapted 3 enables:
- π Education: Reliable transcription of educational content and lectures
- π₯ Healthcare: Accurate medical consultation documentation
- π Business: High-quality call center automation and analytics
- π¬ Media: Professional-grade subtitling and content creation
- π± Technology: Superior voice interfaces for Swahili applications
- ποΈ Government: Accurate transcription of public announcements and meetings
π¬ Technical Insights
Key Success Factors for Adapted 3:
- Data Diversity:
keystats/swahili_asr_data provided crucial linguistic variety
- Training Scale: 137% more data enabled better generalization
- Validation Quality: 528% larger validation set prevented overfitting
- Hyperparameter Refinement: Lessons from v1/v2 informed optimal settings
- Architecture Consistency: Maintained efficient LoRA approach throughout
π License
This model is licensed under the Apache 2.0 License.
π€ Acknowledgments
This model series builds upon:
- Sunbird/salt for foundational Swahili speech data
- keystats/swahili_asr_data for the critical performance breakthrough in v3
- Urban noise augmentation for real-world robustness
- The PEFT/LoRA community for efficient fine-tuning methodologies
π Experience the 41% Accuracy Improvement!
Upgrade to Adapted 3 for production-ready Swahili speech recognition
"Mwenye pupa hadiri" - The hasty one doesn't arrive (Swahili Proverb)
Quality takes time, but delivers superior results
```