Description
This model is a Reward Model trained on the RobotsMali transcription scorer dataset, where the scores were assigned by human annotators.
It predicts a continuous score between 0 and 1 for a pair (audio, text), representing how well the text matches the spoken audio.
The model can be integrated as a Reward Model within RLHF pipelines to evaluate or fine-tune ASR models based on human preference scores.
Model Overview
The model consists of two main encoders β one for audio and one for text β followed by a small regression head that outputs a scalar score.
Audio Encoder
Input: Raw waveform (16 kHz)
Feature extraction: Mel-spectrogram computed from waveform using WhisperFeatureExtractor
Parameters:
- n_fft: 1024
- n_mels: 80
- hop_length: 256
- sample_rate: 16000
Architecture:
- 3 Γ (Conv1d β BatchNorm1d β ReLU).
- Kernel size: 5, stride: 1, padding: 2.
- Channel size: 128.
Text Encoder
Input: Tokenized transcription (IDs from SentencePiece tokenizer)
Architecture:
- Embedding layer: dim = 128, vocab_size = 2048
- Bidirectional LSTM: hidden size = 128, 1 layer
Fusion & Regression Head
Fusion: Concatenate [audio_emb, text_emb]
Regression head:
- Linear(384 β 256) β ReLU β Dropout(0.3)
- Linear(256 β 256) β ReLU
- Linear(256 β 1) β Sigmoid
Output: Scalar β [0, 1] (reward score)
Objective
- Loss: Mean Squared Error (MSE)
- Goal: Predict the similarity score between the spoken audio and its transcription.
Example Usage
First, install our package
pip install git+https://github.com/diarray-hub/bambara-asr.git@rlnf-v2-gpu
import torch
from RLNF.Rewards.reward_config import RewardConfig
from RLNF.Rewards.reward_model import RewardModel
from RLNF.Rewards.reward_processor import RewardModelProcessor
audios = ["1.wav", "2.wav"]
texts = ["kelen", "fila."]
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor : RewardModelProcessor = RewardModelProcessor.from_pretrained("RobotsMali/reward-model")
model : RewardModel = RewardModel.from_pretrained("RobotsMali/reward-model")
model.eval()
model.to(device)
out = processor(audios=audios, texts=texts)
out = {k: v.to(device) if torch.is_tensor(v) else v for k, v in out.items()}
with torch.no_grad() :
preds = model(**out).logits
for i, (t, val) in enumerate(zip(texts, preds)):
print(f"Audio : {audios[i]:<10} | Text: {t:<10} | Score: {val.item() * 100:.4f}")
Evaluation Results
| Metric | Value |
|---|---|
| MSE | 0.07672813534736633 |
| RΒ² | 0.42677074670791626 |
| Pearson | 0.6603442430496216 |
| Accuracy | 0.33 |
- Downloads last month
- 98