Best Fine-Tuned ASR Models
Collection
This collection serves to reflect the best models fine-tuned during several experiments in the task of Automatic Speech Recognition. • 6 items • Updated
Fine-tuned NVIDIA Parakeet-TDT-0.6B-v3 for Slovenian automatic speech recognition, augmented with TTS-generated synthetic data.
This model is part of the paper: "Synthetic Speech Augmentation for Low-Resource Estonian and Slovenian ASR: Comparing Parakeet-TDT and Whisper" (Interspeech 2026). Paper coming soon.
nvidia/parakeet-tdt-0.6b-v3| Test Set | WER | CER |
|---|---|---|
| CommonVoice 17 Test | 11.40 | 2.83 |
| CommonVoice 17 Val | 10.63 | 2.45 |
| FLEURS Test | 34.89 | 9.88 |
| Test Set | WER | CER |
|---|---|---|
| CommonVoice 17 Test | 8.81 | 2.39 |
| CommonVoice 17 Val | 8.50 | 2.08 |
| FLEURS Test | 17.74 | 6.22 |
| Comparison | CV17 Test (WER) | FLEURS Test (WER) |
|---|---|---|
| vs. Zero-shot | -38.83 pp | -5.28 pp |
| vs. CV-only fine-tuning | -2.68 pp | -3.68 pp |
All improvements are statistically significant (paired bootstrap, p < 0.001, n = 100,000).
import nemo.collections.asr as nemo_asr
# Load model
model = nemo_asr.models.ASRModel.from_pretrained("yuriyvnv/parakeet-tdt-0.6b-slovenian")
# Transcribe
transcriptions = model.transcribe(["audio.wav"])
print(transcriptions[0].text)
The synthetic training data was generated using a three-stage pipeline:
Dataset: yuriyvnv/synthetic_asr_et_sl
Base model
nvidia/parakeet-tdt-0.6b-v3