Streaming Vocos: Neural vocoder for fast streaming applications

Streaming Vocos is a streaming-friendly replication of the original Vocos neural vocoder design, modified for causal / streaming inference. Compared to typical GAN vocoders that generate waveform samples in the time domain, Vocos predicts spectral coefficients, enabling fast waveform reconstruction via inverse Fourier transform—making it well-suited for low-latency and real-time settings.

This implementation replaces vanilla CNN blocks with causal CNNs and provides a streaming interface with dynamically adjustable chunk size (in multiples of the hop size).

Input: 50 Hz log-mel spectrogram
- window = 1024, hop = 320
Output: 16 kHz waveform audio

Training follows the GAN objective as in the original Vocos, while adopting loss functions inspired by Descript’s audio codec.

Original Vocos resources:

Audio samples: https://gemelo-ai.github.io/vocos/
Paper: https://arxiv.org/abs/2306.00814

⚡ Streaming Latency & Real-Time Performance

We benchmark Streaming Vocos in streaming inference mode using chunked mel-spectrogram decoding on both CPU and GPU.

Benchmark setup

Audio duration: 3.24 s
Sample rate: 16 kHz
Mel hop size: 320 samples (20 ms per mel frame)
Chunk size: 5 mel frames (100 ms buffering latency)
Runs: 100 warm-up + 1000 timed runs
Inference mode: Streaming (stateful causal decoding)

Metrics

Processing time per chunk
End-to-end latency = chunk buffering + processing time
RTF (Real-Time Factor) = processing time / audio duration

Results

Streaming performance (chunk size = 5 frames, 100 ms buffer)

Device	Avg proc / chunk	First-chunk proc	End-to-end latency	Total proc (3.2 s audio)	RTF
CPU	14.0 ms	14.0 ms	114.0 ms	464 ms	0.14
GPU (CUDA)	3.4 ms	3.3 ms	103.3 ms	113 ms	0.035

End-to-end latency includes the 100 ms chunk buffering delay required for streaming inference.

Interpretation

Real-time capable on CPU
Streaming Vocos achieves an RTF of approximately 0.14, corresponding to inference running ~7× faster than real time.
Ultra-low compute overhead on GPU
Chunk processing time is reduced to ~3.4 ms, making overall latency dominated by buffering rather than computation.
Streaming-friendly first-chunk behavior
First-chunk latency closely matches steady-state latency, indicating no cold-start penalty during streaming inference.
Latency–quality tradeoff
Smaller chunk sizes further reduce buffering latency (e.g., 1–2 frames → <40 ms), at the cost of slightly increased computational overhead.

With a chunk size of 1 frame (20 ms buffering), GPU end-to-end latency drops below 25 ms, making Streaming Vocos suitable for interactive and conversational TTS pipelines.

Checkpoints

This repo provides a PyTorch Lightning checkpoint:

epoch=3.ckpt

You can download it from the “Files” tab, or directly via hf_hub_download (example below).

Quickstart (inference)

Install

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121  # or cpu wheels
pip install lightning librosa scipy matplotlib huggingface_hub

Clone the github repo

git clone https://github.com/warisqr007/vocos.git 
cd vocos

Run inference (offline)

import torch
import librosa
from huggingface_hub import hf_hub_download

from src.modules import VocosVocoderModule  # from the github codebase

ckpt_path = hf_hub_download(
    repo_id="warisqr007/StreamingVocos",
    filename="epoch=3.ckpt",
)

model = VocosVocoderModule.load_from_checkpoint(ckpt_path, map_location="cpu")
model.eval()

wav_path = "your_input.wav"
audio, _ = librosa.load(wav_path, sr=16000, mono=True)

audio_t = torch.from_numpy(audio).unsqueeze(0).unsqueeze(0)  # (B=1,1,T)
mel = model.feature_extractor(audio_t)

with torch.no_grad():
    y = model(mel).squeeze().cpu().numpy()  # reconstructed waveform @ 16kHz

Streaming inference (chunked mel)

import torch

chunk_size = 1  # mel frames per chunk (adjust as desired)

with torch.no_grad(), model.decoder[0].streaming(chunk_size), model.decoder[1].streaming(chunk_size):
    y_chunks = []
    for mel_chunk in mel.split(chunk_size, dim=2):
        y_chunks.append(model(mel_chunk))
    y_stream = torch.cat(y_chunks, dim=2).squeeze().cpu().numpy()

Space demo

A Gradio demo Space is provided here

Acknowledgements

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using warisqr007/StreamingVocos 1

Paper for warisqr007/StreamingVocos

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Paper • 2306.00814 • Published Jun 1, 2023 • 3