--- library_name: transformers tags: - generated_from_trainer model-index: - name: reward-model results: [] license: mit datasets: - RobotsMali/transcription-scorer language: - bm --- # reward-model This model is a Reward Model trained on the [RobotsMali transcription scorer dataset](https://huggingface.co/datasets/RobotsMali/transcription-scorer). It achieves the following results on the evaluation set: - Loss: 0.0609 - R2: 0.5447 - Pearson: 0.7406 ## Model description This model is a Reward Model trained on the [RobotsMali transcription scorer dataset](https://huggingface.co/datasets/RobotsMali/transcription-scorer), where the scores were assigned by human annotators. It predicts a continuous score between 0 and 1 for a pair (audio, text), representing how well the text matches the spoken audio. The model can be integrated as a Reward Model within RLHF pipelines to evaluate or fine-tune ASR models based on human preference scores. ## Intended uses & limitations ### Intended uses - Evaluate the quality of an ASR transcription against audio, producing a continuous score in [0,1]. - Integrate as a **Reward Model** in **RLHF** (Reinforcement Learning from Human Feedback) pipelines for fine-tuning ASR models. - Automatically compare transcriptions generated by different ASR systems or models. - Serve as a reference-free proxy metric for ASR, allowing approximate quality evaluation without requiring reference transcriptions. ### Limitations - Sensitive to **accents, background noise, or pronunciation variations** not represented in the RobotsMali dataset. - Scores are based on **rules defined by our team**, rather than purely subjective judgment, and reflect the specific scoring criteria we established for the dataset. # Training Procedure ## Audio Encoder **Input:** Raw waveform (16 kHz) **Feature extraction:** Mel-spectrogram using the **processor of [RobotsMali's STT-BM-QuartzNet15x5-V0 model](https://huggingface.co/RobotsMali/stt-bm-quartznet15x5-V0)** **Architecture:** - 1D Convolutional layers: `audio_conv_layers` × (Conv1d → BatchNorm1d → ReLU) - Channels: `audio_conv_channels` (input channels = 64, kernel size = `kernel_size`, stride = `stride`, padding = `padding`) - Adaptive Average Pooling over time → output dimension = `audio_conv_channels` --- ## Text Encoder **Input:** Tokenized transcription (IDs from SentencePiece tokenizer) **Architecture:** - Embedding layer: `embed_dim` (vocab_size = `vocab_size`, padding_idx = `pad_token_id`) - Bidirectional LSTM: hidden size = `lstm_hidden`, layers = `lstm_layers` - Sequence pooling: masked mean pooling over sequence length → output dimension = `2 * lstm_hidden` --- ## Fusion & Regression Head **Fusion:** Concatenate `[audio_emb, text_emb]` → combined_dim = `audio_conv_channels + 2 * lstm_hidden` **Regression head:** - Linear(combined_dim → `head_hidden`) → ReLU → Dropout(`dropout`) - Linear(`head_hidden` → `head_hidden`) → ReLU - Linear(`head_hidden` → 1) → Sigmoid **Output:** Scalar ∈ [0, 1] (predicted reward score) --- ## Objective - **Loss:** Mean Squared Error (MSE) - **Goal:** Predict similarity between spoken audio and its transcription | Parameter | Value | |-------------------------|--------| | `audio_conv_layers` | 3 | | `audio_conv_channels` | 128 | | `kernel_size` | 5 | | `stride` | 1 | | `padding` | 2 | | `embed_dim` | 128 | | `vocab_size` | 2048 | | `lstm_hidden` | 128 | | `lstm_layers` | 1 | | `head_hidden` | 256 | | `dropout` | 0.1 | | `pad_token_id` | 1 | ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.0001 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: cosine - num_epochs: 10 ### Training results | Training Loss | Epoch | Step | Validation Loss | Mse | R2 | Pearson | |:-------------:|:-----:|:----:|:---------------:|:------:|:------:|:-------:| | 0.1237 | 0.8 | 100 | 0.1100 | 0.1100 | 0.1781 | 0.5916 | | 0.0675 | 1.6 | 200 | 0.0723 | 0.0723 | 0.4597 | 0.6906 | | 0.0562 | 2.4 | 300 | 0.0684 | 0.0684 | 0.4890 | 0.7094 | | 0.0625 | 3.2 | 400 | 0.0650 | 0.0650 | 0.5145 | 0.7175 | | 0.0563 | 4.0 | 500 | 0.0662 | 0.0662 | 0.5055 | 0.7120 | | 0.0478 | 4.8 | 600 | 0.0616 | 0.0616 | 0.5396 | 0.7398 | | 0.0454 | 5.6 | 700 | 0.0634 | 0.0634 | 0.5266 | 0.7264 | | 0.0429 | 6.4 | 800 | 0.0607 | 0.0607 | 0.5467 | 0.7404 | | 0.0422 | 7.2 | 900 | 0.0615 | 0.0615 | 0.5405 | 0.7429 | | 0.0421 | 8.0 | 1000 | 0.0622 | 0.0622 | 0.5353 | 0.7338 | | 0.0423 | 8.8 | 1100 | 0.0610 | 0.0610 | 0.5446 | 0.7424 | | 0.0485 | 9.6 | 1200 | 0.0610 | 0.0610 | 0.5445 | 0.7416 | ### Framework versions - Transformers 4.53.3 - Pytorch 2.9.0+cu128 - Datasets 3.3.2 - Tokenizers 0.21.4 # Example Usage First, install our package ```bash pip install git+https://github.com/diarray-hub/bambara-asr.git@rlnf-v2-gpu ``` ```python import torch from RLNF.Rewards.reward_model import RewardModel from RLNF.Rewards.reward_processor import RewardModelProcessor from RLNF.Rewards.reward_feature_extraction import RewardFeatureExtractor from transformers import T5Tokenizer from nemo.collections.asr.models import EncDecCTCModel audios = ["1.wav", "2.wav"] texts = ["kelen", "fila."] device = torch.device("cuda" if torch.cuda.is_available() else "cpu") tokenizer : T5Tokenizer = T5Tokenizer.from_pretrained("RobotsMali/reward-model") asr_model : EncDecCTCModel= EncDecCTCModel.from_pretrained("RobotsMali/stt-bm-quartznet15x5-V0") feature_extractor : RewardFeatureExtractor = RewardFeatureExtractor(asr_model) processor : RewardModelProcessor = RewardModelProcessor(feature_extractor, tokenizer) model : RewardModel = RewardModel.from_pretrained("RobotsMali/reward-model") model.eval() model.to(device) out = processor(audios=audios, texts=texts) out = {k: v.to(device) if torch.is_tensor(v) else v for k, v in out.items()} with torch.no_grad() : preds = model(**out).logits for i, (t, val) in enumerate(zip(texts, preds)): print(f"Audio : {audios[i]:<10} | Text: {t:<10} | Score: {val.item() * 100:.4f}") ```