Description
This model is a Reward Model trained on the RobotsMali transcription scorer dataset, where the scores were assigned by human annotators.
It predicts a continuous score between 0 and 1 for a pair (audio, text), representing how well the text matches the spoken audio.
The model can be integrated as a Reward Model within RLHF pipelines to evaluate or fine-tune ASR models based on human preference scores.
Model Overview
The model consists of two main encoders β one for audio and one for text β followed by a small regression head that outputs a scalar score.
Audio Encoder
Input: Raw waveform (16 kHz)
Feature extraction: Mel-spectrogram computed from waveform (amplitude β dB)
Parameters:
n_fft: 1024n_mels: 80hop_length: 256sample_rate: 16000
Architecture:
- 3 Γ (Conv1d β BatchNorm1d β ReLU).
- Kernel size: 5, stride: 1, padding: 2.
- Channel size: 128.
- Temporal pooling: AdaptiveAvgPool1d(1).
Text Encoder
Input: Tokenized transcription (IDs from SentencePiece tokenizer)
Architecture:
- Embedding layer:
dim = 128,vocab_size = 5000 - Bidirectional LSTM: hidden size = 128, 1 layer
- Output: mean pooling over valid tokens
Fusion & Regression Head
Fusion: Concatenate [audio_emb, text_emb]
Regression head:
- Linear(384 β 256) β ReLU β Dropout(0.3)
- Linear(256 β 256) β ReLU
- Linear(256 β 1) β Sigmoid
Output: Scalar β [0, 1] (reward score)
Objective
- Loss: Mean Squared Error (MSE)
- Goal: Predict the similarity score between the spoken audio and its transcription.
Example Usage
Comming soon ......
- Downloads last month
- 21