Description
This model is a Reward Model trained on the RobotsMali transcription scorer dataset, where the scores were assigned by human annotators.
It predicts a continuous score between 0 and 1 for a pair (audio, text), representing how well the text matches the spoken audio.
The model can be integrated as a Reward Model within RLHF pipelines to evaluate or fine-tune ASR models based on human preference scores.
Model Overview
The model consists of two main encoders β one for audio and one for text β followed by a small regression head that outputs a scalar score.
Audio Encoder
Input: Raw waveform (16 kHz)
Feature extraction: Mel-spectrogram computed from waveform using WhisperFeatureExtractor
Parameters:
- n_fft: 1024
- n_mels: 80
- hop_length: 256
- sample_rate: 16000
Architecture:
- 3 Γ (Conv1d β BatchNorm1d β ReLU).
- Kernel size: 5, stride: 1, padding: 2.
- Channel size: 128.
Text Encoder
Input: Tokenized transcription (IDs from SentencePiece tokenizer)
Architecture:
- Embedding layer: dim = 128,vocab_size = 2048
- Bidirectional LSTM: hidden size = 128, 1 layer
- Output: mean pooling over valid tokens
Fusion & Regression Head
Fusion: Concatenate [audio_emb, text_emb]  
Regression head:
- Linear(384 β 256) β ReLU β Dropout(0.3)
- Linear(256 β 256) β ReLU
- Linear(256 β 1) β Sigmoid
Output: Scalar β [0, 1] (reward score)
Objective
- Loss: Mean Squared Error (MSE)
- Goal: Predict the similarity score between the spoken audio and its transcription.
Example Usage
Comming soon ......
- Downloads last month
- 47