You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Description

This model is a Reward Model trained on the RobotsMali transcription scorer dataset, where the scores were assigned by human annotators.
It predicts a continuous score between 0 and 1 for a pair (audio, text), representing how well the text matches the spoken audio.

The model can be integrated as a Reward Model within RLHF pipelines to evaluate or fine-tune ASR models based on human preference scores.

Model Overview

The model consists of two main encoders — one for audio and one for text — followed by a small regression head that outputs a scalar score.

Audio Encoder

Input: Raw waveform (16 kHz)
Feature extraction: Mel-spectrogram computed from waveform using WhisperFeatureExtractor

Parameters:

n_fft: 1024
n_mels: 80
hop_length: 256
sample_rate: 16000

Architecture:

3 × (Conv1d → BatchNorm1d → ReLU).
Kernel size: 5, stride: 1, padding: 2.
Channel size: 128.

Text Encoder

Input: Tokenized transcription (IDs from SentencePiece tokenizer)
Architecture:

Embedding layer: dim = 128, vocab_size = 2048
Bidirectional LSTM: hidden size = 128, 1 layer
Output: mean pooling over valid tokens

Fusion & Regression Head

Fusion: Concatenate [audio_emb, text_emb]

Regression head:

Linear(384 → 256) → ReLU → Dropout(0.3)
Linear(256 → 256) → ReLU
Linear(256 → 1) → Sigmoid

Output: Scalar ∈ [0, 1] (reward score)

Objective

Loss: Mean Squared Error (MSE)
Goal: Predict the similarity score between the spoken audio and its transcription.

Example Usage

Comming soon ......

Downloads last month: 47

Safetensors

Model size

908k params

Tensor type

F32

Panga-Azazia
/

reward-model