VibeVoice Technical Report
Paper
β’ 2508.19205 β’ Published
β’ 143
ONNX INT4 quantized version of Microsoft VibeVoice-Realtime-0.5B for browser deployment with transformers.js.
| File | Size | Description |
|---|---|---|
tts_llm_int4.onnx |
702 MB | Qwen2-based language model (text β hidden states) |
vocoder_int4.onnx |
339 MB | Ο-VAE decoder (latents β audio) |
diffusion_head_int4.onnx |
25 MB | DDPM diffusion (hidden states β latents) |
diffusion_head_int8.onnx |
40 MB | INT8 variant for comparison |
| Total | ~1.07 GB | Down from 3.1 GB FP32 |
Text Input
β
[Tokenizer] β Token IDs (vocab: 151936)
β
[TTS LLM] β Hidden States (896-dim)
β
[Diffusion Head] β Noise (20 steps, cosine schedule)
β
Acoustic Latents (64-dim)
β
[Vocoder]
β
Audio Waveform (24kHz)
import * as ort from 'onnxruntime-web';
// Load models
const llm = await ort.InferenceSession.create('tts_llm_int4.onnx');
const diffusion = await ort.InferenceSession.create('diffusion_head_int4.onnx');
const vocoder = await ort.InferenceSession.create('vocoder_int4.onnx');
// Run LLM
const inputIds = new ort.Tensor('int64', tokenizedText, [1, seqLen]);
const attentionMask = new ort.Tensor('int64', ones, [1, seqLen]);
const positionIds = new ort.Tensor('int64', positions, [1, seqLen]);
const { hidden_states } = await llm.run({
input_ids: inputIds,
attention_mask: attentionMask,
position_ids: positionIds
});
// Run diffusion (20 steps)
let latent = noise;
for (let t = 999; t >= 0; t -= 50) {
const { v_prediction } = await diffusion.run({
noisy_latent: latent,
timestep: new ort.Tensor('float32', [t], [1]),
hidden_states: hidden_states
});
latent = denoise(latent, v_prediction, t);
}
// Run vocoder
const { audio } = await vocoder.run({ latents: latent });
{
"audio": {
"sample_rate": 24000,
"vae_dim": 64
},
"llm": {
"hidden_size": 896,
"num_hidden_layers": 20,
"vocab_size": 151936
},
"diffusion": {
"num_inference_steps": 20,
"beta_schedule": "cosine",
"prediction_type": "v_prediction"
}
}