reward-model

This model is a Reward Model trained on the RobotsMali transcription scorer dataset. It achieves the following results on the evaluation set:

Loss: 0.0609
R2: 0.5447
Pearson: 0.7406

Model description

This model is a Reward Model trained on the RobotsMali transcription scorer dataset, where the scores were assigned by human annotators. It predicts a continuous score between 0 and 1 for a pair (audio, text), representing how well the text matches the spoken audio.

The model can be integrated as a Reward Model within RLHF pipelines to evaluate or fine-tune ASR models based on human preference scores.

Intended uses & limitations

Intended uses

Evaluate the quality of an ASR transcription against audio, producing a continuous score in [0,1].
Integrate as a Reward Model in RLHF (Reinforcement Learning from Human Feedback) pipelines for fine-tuning ASR models.
Automatically compare transcriptions generated by different ASR systems or models.
Serve as a reference-free proxy metric for ASR, allowing approximate quality evaluation without requiring reference transcriptions.

Limitations

Sensitive to accents, background noise, or pronunciation variations not represented in the RobotsMali dataset.
Scores are based on rules defined by our team, rather than purely subjective judgment, and reflect the specific scoring criteria we established for the dataset.

Training Procedure

Audio Encoder

Input: Raw waveform (16 kHz)
Feature extraction: Mel-spectrogram using the processor of RobotsMali's STT-BM-QuartzNet15x5-V0 model

Architecture:

1D Convolutional layers: audio_conv_layers × (Conv1d → BatchNorm1d → ReLU)
Channels: audio_conv_channels (input channels = 64, kernel size = kernel_size, stride = stride, padding = padding)
Adaptive Average Pooling over time → output dimension = audio_conv_channels

Text Encoder

Input: Tokenized transcription (IDs from SentencePiece tokenizer)

Architecture:

Embedding layer: embed_dim (vocab_size = vocab_size, padding_idx = pad_token_id)
Bidirectional LSTM: hidden size = lstm_hidden, layers = lstm_layers
Sequence pooling: masked mean pooling over sequence length → output dimension = 2 * lstm_hidden

Fusion & Regression Head

Fusion: Concatenate [audio_emb, text_emb] → combined_dim = audio_conv_channels + 2 * lstm_hidden

Regression head:

Linear(combined_dim → head_hidden) → ReLU → Dropout(dropout)
Linear(head_hidden → head_hidden) → ReLU
Linear(head_hidden → 1) → Sigmoid

Output: Scalar ∈ [0, 1] (predicted reward score)

Objective

Loss: Mean Squared Error (MSE)
Goal: Predict similarity between spoken audio and its transcription

Parameter	Value
`audio_conv_layers`	3
`audio_conv_channels`	128
`kernel_size`	5
`stride`	1
`padding`	2
`embed_dim`	128
`vocab_size`	2048
`lstm_hidden`	128
`lstm_layers`	1
`head_hidden`	256
`dropout`	0.1
`pad_token_id`	1

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
num_epochs: 10

Training results

Training Loss	Epoch	Step	Validation Loss	Mse	R2	Pearson
0.1237	0.8	100	0.1100	0.1100	0.1781	0.5916
0.0675	1.6	200	0.0723	0.0723	0.4597	0.6906
0.0562	2.4	300	0.0684	0.0684	0.4890	0.7094
0.0625	3.2	400	0.0650	0.0650	0.5145	0.7175
0.0563	4.0	500	0.0662	0.0662	0.5055	0.7120
0.0478	4.8	600	0.0616	0.0616	0.5396	0.7398
0.0454	5.6	700	0.0634	0.0634	0.5266	0.7264
0.0429	6.4	800	0.0607	0.0607	0.5467	0.7404
0.0422	7.2	900	0.0615	0.0615	0.5405	0.7429
0.0421	8.0	1000	0.0622	0.0622	0.5353	0.7338
0.0423	8.8	1100	0.0610	0.0610	0.5446	0.7424
0.0485	9.6	1200	0.0610	0.0610	0.5445	0.7416

Framework versions

Transformers 4.53.3
Pytorch 2.9.0+cu128
Datasets 3.3.2
Tokenizers 0.21.4

Example Usage

First, install our package

pip install git+https://github.com/diarray-hub/bambara-asr.git@rlnf-v2-gpu

import torch
from RLNF.Rewards.reward_model import RewardModel
from RLNF.Rewards.reward_processor import RewardModelProcessor
from RLNF.Rewards.reward_feature_extraction import RewardFeatureExtractor
from transformers import T5Tokenizer
from nemo.collections.asr.models import EncDecCTCModel

audios = ["1.wav", "2.wav"]
texts = ["kelen", "fila."]

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer : T5Tokenizer = T5Tokenizer.from_pretrained("RobotsMali/reward-model")
asr_model : EncDecCTCModel= EncDecCTCModel.from_pretrained("RobotsMali/stt-bm-quartznet15x5-V0")
feature_extractor : RewardFeatureExtractor = RewardFeatureExtractor(asr_model)

processor : RewardModelProcessor = RewardModelProcessor(feature_extractor, tokenizer)

model : RewardModel = RewardModel.from_pretrained("RobotsMali/reward-model")

model.eval()
model.to(device)
    
out = processor(audios=audios, texts=texts)    
out = {k: v.to(device) if torch.is_tensor(v) else v for k, v in out.items()}


with torch.no_grad() :
  preds = model(**out).logits
    
    
for i, (t, val) in enumerate(zip(texts, preds)):
  print(f"Audio : {audios[i]:<10} | Text: {t:<10} | Score: {val.item() * 100:.4f}")

Downloads last month: 44

Safetensors

Model size

898k params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

RobotsMali
/

reward-model