--- license: cc-by-nc-3.0 datasets: - wenet-e2e/wenetspeech - pengyizhou/wenetspeech-subset-S language: - zh metrics: - cer base_model: - openai/whisper-large-v3 pipeline_tag: automatic-speech-recognition --- # Whisper Fine-tuning for Chinese (WenetSpeech) This project provides a configurable way to fine-tune OpenAI's Whisper model specifically on the WenetSpeech Chinese speech dataset. ## Features - **Flexible Configuration**: All parameters are configurable through YAML files - **Multi-GPU Support**: Automatic detection and support for multiple GPUs - **Dynamic Language Selection**: Train on any subset of supported languages - **On-the-fly Processing**: Efficient memory usage with dynamic audio preprocessing - **Comprehensive Evaluation**: Automatic evaluation on test sets ## Configuration All parameters are configurable through the `config.yaml` file. This configuration is specifically set up for Chinese speech training using the WenetSpeech dataset. ### Model Configuration - Model checkpoint (default: `openai/whisper-large-v3`) - Maximum target length for sequences ### Dataset Configuration - Uses WenetSpeech Chinese speech dataset - Multiple dataset splits (train, validation, test_net, test_meeting) - Language-specific settings - Training configuration optimized for Chinese speech recognition ### Training Configuration - Learning rate, batch sizes, training steps - Multi-GPU vs single GPU settings - Evaluation and logging parameters ### Environment Configuration - CPU core limits - Environment variables for optimization ### Pushing to Hub - I have set the configuration to not push to the Hugging Face Hub by default. You can enable this by setting `push_to_hub: true` in your config file. ## Usage ### Basic Usage ```bash python finetune.py --config config.yaml ``` ### Custom Configuration ```bash python finetune.py --config my_custom_config.yaml ``` ### Multi-GPU Training ```bash # Using torchrun (recommended) for two GPUs torchrun --nproc_per_node=2 finetune.py --config config.yaml ``` ## Configuration File Structure The `config.yaml` file is organized into the following sections: 1. **model**: Model checkpoint and sequence length settings 2. **output**: Output directory configuration 3. **environment**: Environment variables and CPU settings 4. **audio**: Audio processing settings (sampling rate) 5. **languages**: Chinese language configuration 6. **datasets**: WenetSpeech dataset configuration 7. **training**: All training hyperparameters 8. **data_processing**: Data processing settings ## Customizing Your Training ### Adjusting Training Parameters Modify the `training` section in `config.yaml`: - Change learning rate, batch sizes, or training steps - Adjust evaluation frequency - Configure multi-GPU settings ### Environment Optimization Adjust the `environment` section to optimize for your system: - Set CPU core limits - Configure memory usage settings ## Configuration The provided `config.yaml` is specifically configured for Chinese WenetSpeech training. ## Training Commands ### Basic Training ```bash python finetune.py ``` ### Single GPU Training ```bash python finetune.py ``` ### Multi-GPU Training ```bash torchrun --nproc_per_node=2 finetune.py ``` ## Inference Guide After training your model, you can use the provided `inference.py` script for speech recognition: ```bash python inference.py ``` The inference script includes: - Model loading from the trained checkpoint - Audio preprocessing pipeline - Text generation with proper formatting - Support for Chinese speech transcription ### Using the Trained Model The inference script automatically handles: - Loading the fine-tuned model weights - Audio preprocessing with proper sampling rate - Generating transcriptions for Chinese speech - Output formatting for evaluation metrics ### WenetSpeech Evaluation **Evaluation Protocol**: WenetSpeech provides multiple test sets for comprehensive evaluation: ```bash # Run inference to generate predictions python inference.py ``` **Test Sets Available**: - **DEV**: Development set for validation during training - **TEST_NET**: Internet-sourced audio test set - **TEST_MEETING**: Meeting audio test set The evaluation uses Character Error Rate (CER) which is appropriate for Chinese speech recognition. #### WenetSpeech Dataset Characteristics - **High Quality**: Professional recordings with clean annotations - **Diverse Content**: Multiple domains including internet audio and meeting recordings - **Large Scale**: Extensive dataset for robust Chinese ASR training - **Standard Benchmark**: Widely used for Chinese speech recognition evaluation ## WenetSpeech Dataset This model is specifically designed for Chinese speech recognition using the WenetSpeech corpus. Key characteristics: - **Multi-domain Audio**: Internet videos, audiobooks, podcasts, and meeting recordings - **High-quality Annotations**: Professional Chinese transcriptions with punctuation - **Large Scale**: 10,000+ hours of labeled Chinese speech data (We only use the subset-S for training) - **Standard Benchmark**: Widely adopted for Chinese ASR research and development ### Dataset Characteristics - **Audio Quality**: Various quality levels from internet sources to studio recordings - **Speaking Styles**: Read speech, spontaneous speech, and conversational audio - **Vocabulary**: Large vocabulary covering diverse topics and domains - **Language**: Mandarin Chinese with standard simplified Chinese characters - **Evaluation Protocol**: Character Error Rate (CER) based evaluation ### Training Configuration - **Training Data**: WenetSpeech subset-S for efficient training - **Validation**: DEV_fixed split for model selection - **Test Sets**: TEST_NET (internet audio) and TEST_MEETING (meeting audio) - **Metric**: Character Error Rate (CER) optimized for Chinese script ## Dependencies Install required packages: ```bash pip install -r requirements.txt ``` Key dependencies: - PyYAML (for configuration loading) - torch, transformers, datasets - librosa (for audio processing) - evaluate (for metrics) ## Zeroshot Results | LID | Datasets | Metric | Error Rate | |-----|-------------|:------:|-----------:| |Chinese| Chinese-NET | CER | 12.16% | |Chinese| Chinese-MTG | CER | 19.83% | |Auto | Chinese-NET | CER | 12.37% | |Auto | Chinese-MTG | CER | 20.03% | ## Evaluation Results | LID | Datasets | Metric | Error Rate | |-----|-------------|:------:|-----------:| |Chinese| Chinese-NET | CER | 13.16% | |Chinese| Chinese-MTG | CER | 22.35% | |Auto | Chinese-NET | CER | 13.16% | |Auto | Chinese-MTG | CER | 22.34% | **Note**: If you encounter issues running finetune.py, you can use the `finetune-backup.py` file which contains the original hardcoded configuration that was used to generate these evaluation metrics.