---
library_name: transformers
tags:
- generated_from_trainer
model-index:
- name: reward-model
  results: []
license: mit
datasets:
- RobotsMali/transcription-scorer
language:
- bm
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# reward-model

This model is a Reward Model trained on the [RobotsMali transcription scorer dataset](https://huggingface.co/datasets/RobotsMali/transcription-scorer).
It achieves the following results on the evaluation set:
- Loss: 0.0609
- R2: 0.5447
- Pearson: 0.7406

## Model description

This model is a Reward Model trained on the [RobotsMali transcription scorer dataset](https://huggingface.co/datasets/RobotsMali/transcription-scorer), where the scores were assigned by human annotators.
It predicts a continuous score between 0 and 1 for a pair (audio, text), representing how well the text matches the spoken audio.

The model can be integrated as a Reward Model within RLHF pipelines to evaluate or fine-tune ASR models based on human preference scores.

## Intended uses & limitations

### Intended uses
- Evaluate the quality of an ASR transcription against audio, producing a continuous score in [0,1].  
- Integrate as a **Reward Model** in **RLHF** (Reinforcement Learning from Human Feedback) pipelines for fine-tuning ASR models.  
- Automatically compare transcriptions generated by different ASR systems or models.
- Serve as a reference-free proxy metric for ASR, allowing approximate quality evaluation without requiring reference transcriptions.

### Limitations 
- Sensitive to **accents, background noise, or pronunciation variations** not represented in the RobotsMali dataset.  
- Scores are based on **rules defined by our team**, rather than purely subjective judgment, and reflect the specific scoring criteria we established for the dataset. 


# Training Procedure

## Audio Encoder

**Input:** Raw waveform (16 kHz)  
**Feature extraction:** Mel-spectrogram using the **processor of  [RobotsMali's STT-BM-QuartzNet15x5-V0 model](https://huggingface.co/RobotsMali/stt-bm-quartznet15x5-V0)**  

**Architecture:**  
- 1D Convolutional layers: `audio_conv_layers` × (Conv1d → BatchNorm1d → ReLU)  
- Channels: `audio_conv_channels` (input channels = 64, kernel size = `kernel_size`, stride = `stride`, padding = `padding`)  
- Adaptive Average Pooling over time → output dimension = `audio_conv_channels`  

---

## Text Encoder

**Input:** Tokenized transcription (IDs from SentencePiece tokenizer)  

**Architecture:**  
- Embedding layer: `embed_dim` (vocab_size = `vocab_size`, padding_idx = `pad_token_id`)  
- Bidirectional LSTM: hidden size = `lstm_hidden`, layers = `lstm_layers`  
- Sequence pooling: masked mean pooling over sequence length → output dimension = `2 * lstm_hidden`  

---

## Fusion & Regression Head

**Fusion:** Concatenate `[audio_emb, text_emb]` → combined_dim = `audio_conv_channels + 2 * lstm_hidden`  

**Regression head:**  
- Linear(combined_dim → `head_hidden`) → ReLU → Dropout(`dropout`)  
- Linear(`head_hidden` → `head_hidden`) → ReLU  
- Linear(`head_hidden` → 1) → Sigmoid  

**Output:** Scalar ∈ [0, 1] (predicted reward score)  

---

## Objective

- **Loss:** Mean Squared Error (MSE)  
- **Goal:** Predict similarity between spoken audio and its transcription

| Parameter               | Value |
|-------------------------|--------|
| `audio_conv_layers`     | 3      |
| `audio_conv_channels`   | 128    |
| `kernel_size`           | 5      |
| `stride`                | 1      |
| `padding`               | 2      |
| `embed_dim`             | 128    |
| `vocab_size`            | 2048   |
| `lstm_hidden`           | 128    |
| `lstm_layers`           | 1      |
| `head_hidden`           | 256    |
| `dropout`               | 0.1    |
| `pad_token_id`          | 1      |


### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- num_epochs: 10

### Training results

| Training Loss | Epoch | Step | Validation Loss | Mse    | R2     | Pearson |
|:-------------:|:-----:|:----:|:---------------:|:------:|:------:|:-------:|
| 0.1237        | 0.8   | 100  | 0.1100          | 0.1100 | 0.1781 | 0.5916  |
| 0.0675        | 1.6   | 200  | 0.0723          | 0.0723 | 0.4597 | 0.6906  |
| 0.0562        | 2.4   | 300  | 0.0684          | 0.0684 | 0.4890 | 0.7094  |
| 0.0625        | 3.2   | 400  | 0.0650          | 0.0650 | 0.5145 | 0.7175  |
| 0.0563        | 4.0   | 500  | 0.0662          | 0.0662 | 0.5055 | 0.7120  |
| 0.0478        | 4.8   | 600  | 0.0616          | 0.0616 | 0.5396 | 0.7398  |
| 0.0454        | 5.6   | 700  | 0.0634          | 0.0634 | 0.5266 | 0.7264  |
| 0.0429        | 6.4   | 800  | 0.0607          | 0.0607 | 0.5467 | 0.7404  |
| 0.0422        | 7.2   | 900  | 0.0615          | 0.0615 | 0.5405 | 0.7429  |
| 0.0421        | 8.0   | 1000 | 0.0622          | 0.0622 | 0.5353 | 0.7338  |
| 0.0423        | 8.8   | 1100 | 0.0610          | 0.0610 | 0.5446 | 0.7424  |
| 0.0485        | 9.6   | 1200 | 0.0610          | 0.0610 | 0.5445 | 0.7416  |


### Framework versions

- Transformers 4.53.3
- Pytorch 2.9.0+cu128
- Datasets 3.3.2
- Tokenizers 0.21.4

# Example Usage

First, install our package

```bash
pip install git+https://github.com/diarray-hub/bambara-asr.git@rlnf-v2-gpu
```

```python
import torch
from RLNF.Rewards.reward_model import RewardModel
from RLNF.Rewards.reward_processor import RewardModelProcessor
from RLNF.Rewards.reward_feature_extraction import RewardFeatureExtractor
from transformers import T5Tokenizer
from nemo.collections.asr.models import EncDecCTCModel

audios = ["1.wav", "2.wav"]
texts = ["kelen", "fila."]

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer : T5Tokenizer = T5Tokenizer.from_pretrained("RobotsMali/reward-model")
asr_model : EncDecCTCModel= EncDecCTCModel.from_pretrained("RobotsMali/stt-bm-quartznet15x5-V0")
feature_extractor : RewardFeatureExtractor = RewardFeatureExtractor(asr_model)

processor : RewardModelProcessor = RewardModelProcessor(feature_extractor, tokenizer)

model : RewardModel = RewardModel.from_pretrained("RobotsMali/reward-model")

model.eval()
model.to(device)
    
out = processor(audios=audios, texts=texts)    
out = {k: v.to(device) if torch.is_tensor(v) else v for k, v in out.items()}


with torch.no_grad() :
  preds = model(**out).logits
    
    
for i, (t, val) in enumerate(zip(texts, preds)):
  print(f"Audio : {audios[i]:<10} | Text: {t:<10} | Score: {val.item() * 100:.4f}")
```