whisper-large-v3-turbo-RKNN2

(English README see below)

Whisper 是由 OpenAI 的 Alec Radford 等人提出的自动语音识别（ASR）和语音翻译前沿模型，相关论文为《通过大规模弱监督实现鲁棒语音识别》。该模型基于超过 500 万小时的标注数据训练，在零样本设定下展现出强大的跨数据集和跨领域泛化能力。 Whisper large-v3-turbo 是经过剪枝的 Whisper large-v3 微调版本。换言之，该模型与原始架构完全相同，仅将解码层数量从 32 层缩减至 4 层。此举显著提升了推理速度，同时仅带来轻微的质量损失。

推理速度(RK3588, 单NPU核)：Encoder ~~12s, Decoder ~380ms (~~2.6tps)
大致内存占用(RK3588)：约2.2GB

使用方法

克隆项目到本地
安装依赖

pip install numpy scipy soundfile tqdm tokenizers librosa ztu-somemodelruntime-ez-rknn-async

运行

python deploy_onnx_no_transformers-rknn2.py --model-dir . --wav long_test.wav --encoder encoder_with_kv.rknn --decoder decoder_static_kv.rknn --language zh --task transcribe

运行效果

[chunk 0 | 0.00s -> 30.00s]
大家好们呀今天给大家分享的是在选一些语音生成网站的合集能够更加方便大家选择自己想要生成的角色进入网站可以看到所有的生成网行都在这里选择你想要生成的角色点击进入就来到了生成的页面在文本框内输入你想要生成的内容然后点击生成就好了另外呢因为每次的生成结果都会有一些不一样的地方
[chunk 1 | 30.00s -> 40.31s]
生成效果不好的话可以尝试重新生成也可以稍微调整一下下面的数值再生成试试使用时一定要遵守法律法规不可以损害刷人的形象哦
wav: long_test.wav
audio_seconds: 40.31
chunk_count: 2
chunk_seconds: 30.00
chunk_overlap_seconds: 0.00
encoder_ms: 38757.05
avg_decoder_step_ms: 380.94
merged_text:
大家好们呀今天给大家分享的是在选一些语音生成网站的合集能够更加方便大家选择自己想要生成的角色进入网站可以看到所有的生成网行都在这里选择你想要生成的角色点击进入就来到了生成的页面在文本框内输入你想要生成的内容然后点击生成就好了另外呢因为每次的生成结果都会有一些不一样的地方生成效果不好的话可以尝试重新生成也可以稍微调整一下下面的数值再生成试试使用时一定要遵守法律法规不可以损害刷人的形象哦

模型转换

安装依赖

测试可用的版本：

torch==2.10.0
rknn-toolkit2==2.3.2

下载whisper的模型权重到本地
导出ONNX

python export_static_decoder.py \
  --model-dir . \
  --encoder-onnx-path encoder_with_kv.onnx \
  --onnx-path decoder_static_kv.onnx \
  --max-decode-len 448

转换RKNN

python convert_encoder_with_kv.py
python convert_decoder_static_kv.py

已知问题

decoder模型推理时拷贝输入的耗时非常高(~~200ms, 甚至比模型推理耗时(~~120ms)还要高很多)，导致计算效率非常低。可能是由于kv cache的5D输入导致的？调整输入shape可能会解决这个问题，或者使用零拷贝输入。
比较新的beta版本的rknn-toolkit2工具链有矩阵乘法分块优化，对于encoder模型中的大矩阵乘法似乎有很好的优化效果。但是我懒得去试了。另外，beta版本也有很多bug。（虽然其实稳定版的bug也不少）
这个模型是2022年发布的，现在都2026年了，所以我不建议继续使用这个模型。可以看看我做的Qwen3-ASR-RKNN2模型，效果更好，推理速度更快，内存占用更少。

参考

openai/whisper-large-v3-turbo
onnx-community/whisper-large-v3-turbo
感谢GPT-5.4提醒我可以使用固定shape的kv cache来绕过动态shape限制。

English README

whisper-large-v3-turbo-RKNN2

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed by Alec Radford et al. from OpenAI in the paper Robust Speech Recognition via Large-Scale Weak Supervision. Trained on over 5 million hours of labeled data, the model demonstrates strong generalization across datasets and domains in zero-shot settings. Whisper large-v3-turbo is a pruned and fine-tuned version of Whisper large-v3. It shares the exact same architecture but reduces the number of decoder layers from 32 to 4, significantly improving inference speed with only a minor quality degradation.

Inference speed (RK3588, single NPU core): Encoder ~~12s, Decoder ~380ms (~~2.6 tps)
Approximate memory usage (RK3588): ~2.2 GB

Usage

Clone the repository locally.
Install dependencies:

pip install numpy scipy soundfile tqdm tokenizers librosa ztu-somemodelruntime-ez-rknn-async

Run inference:

python deploy_onnx_no_transformers-rknn2.py --model-dir . --wav long_test.wav --encoder encoder_with_kv.rknn --decoder decoder_static_kv.rknn --language zh --task transcribe

Sample Output

[chunk 0 | 0.00s -> 30.00s]
大家好们呀今天给大家分享的是在选一些语音生成网站的合集能够更加方便大家选择自己想要生成的角色进入网站可以看到所有的生成网行都在这里选择你想要生成的角色点击进入就来到了生成的页面在文本框内输入你想要生成的内容然后点击生成就好了另外呢因为每次的生成结果都会有一些不一样的地方
[chunk 1 | 30.00s -> 40.31s]
生成效果不好的话可以尝试重新生成也可以稍微调整一下下面的数值再生成试试使用时一定要遵守法律法规不可以损害刷人的形象哦
wav: long_test.wav
audio_seconds: 40.31
chunk_count: 2
chunk_seconds: 30.00
chunk_overlap_seconds: 0.00
encoder_ms: 38757.05
avg_decoder_step_ms: 380.94
merged_text:
大家好们呀今天给大家分享的是在选一些语音生成网站的合集能够更加方便大家选择自己想要生成的角色进入网站可以看到所有的生成网行都在这里选择你想要生成的角色点击进入就来到了生成的页面在文本框内输入你想要生成的内容然后点击生成就好了另外呢因为每次的生成结果都会有一些不一样的地方生成效果不好的话可以尝试重新生成也可以稍微调整一下下面的数值再生成试试使用时一定要遵守法律法规不可以损害刷人的形象哦

Model Conversion

Install dependencies:

Tested versions:

torch==2.10.0
rknn-toolkit2==2.3.2

Download the Whisper model weights locally.
Export to ONNX:

python export_static_decoder.py \
  --model-dir . \
  --encoder-onnx-path encoder_with_kv.onnx \
  --onnx-path decoder_static_kv.onnx \
  --max-decode-len 448

Convert to RKNN:

python convert_encoder_with_kv.py
python convert_decoder_static_kv.py

Known Issues

The input copy overhead during decoder inference is very high (~200 ms, even significantly higher than the actual model inference time of ~120 ms), resulting in poor computational efficiency. This may be caused by the 5D input shape of the KV cache. Reshaping the input or using zero-copy input may resolve this issue.
Newer beta versions of the rknn-toolkit2 toolchain include matrix multiplication tiling optimizations that seem to benefit the large matmul operations in the encoder. However, I haven't bothered to test it. Besides, beta versions also come with many bugs (though the stable releases aren't exactly bug-free either).
This model was released in 2022. It's 2026 now, so I would not recommend continuing to use it. Check out my Qwen3-ASR-RKNN2 model instead — it delivers better accuracy, faster inference, and lower memory usage.

References

openai/whisper-large-v3-turbo
onnx-community/whisper-large-v3-turbo
Thanks to GPT-5.4 for reminding me that fixed-shape KV caches can be used to bypass the dynamic shape limitation.

Downloads last month: 29

Model tree for happyme531/whisper-large-v3-turbo-RKNN2

Base model

openai/whisper-large-v3

Finetuned

openai/whisper-large-v3-turbo

Finetuned

(469)

this model