Fish Audio S2 Pro Slow AR GPTQ W4A16

Fish Audio S2 Pro overview — fine-grained control, multi-speaker multi-turn generation, low-latency streaming, and long-context inference.

Fish Audio S2 Pro Slow AR GPTQ W4A16 is a GPTQ-quantized release of Fish Audio S2 Pro for text-to-speech (TTS). It keeps the original Dual-AR structure and codec assets, while quantizing the Slow AR 4B backbone for lower VRAM and smaller model size.

Quantization Details

This repository contains a W4A16 GPTQ quantization of the Slow AR 4B part of S2 Pro:

Quantized to 4-bit weights / 16-bit activations:
- text_model.model.layers.*.attention.wqkv
- text_model.model.layers.*.attention.wo
- text_model.model.layers.*.feed_forward.w1
- text_model.model.layers.*.feed_forward.w3
- Most text_model.model.layers.*.feed_forward.w2
Kept in bf16:
- text_model.model.embeddings
- the full audio_decoder branch (Fast AR)
- text_model.model.layers.{1,6,16,32,33,34,35}.feed_forward.w2
- text_model.model.layers.{34,35}.attention.wo
- codec.pth

Quantization settings used for this release:

GPTQ
bits=4
group_size=32
desc_act=true
sym=true

This means the model is not a full-model W4A16 dump. The quantized part is the Slow AR transformer backbone, while the Fast AR audio decoder and codec remain in higher precision for better TTS quality and stability.

Architecture

S2 Pro builds on a decoder-only transformer combined with an RVQ-based audio codec (10 codebooks, ~21 Hz frame rate) using a Dual-Autoregressive (Dual-AR) architecture:

Slow AR (4B parameters): Operates along the time axis and predicts the primary semantic codebook.
Fast AR (400M parameters): Generates the remaining 9 residual codebooks at each time step, reconstructing fine-grained acoustic detail.

This asymmetric design keeps inference efficient while preserving audio fidelity. Because the Dual-AR architecture is structurally isomorphic to standard autoregressive LLMs, it inherits all LLM-native serving optimizations from SGLang — including continuous batching, paged KV cache, CUDA graph replay, and RadixAttention-based prefix caching.

Fine-Grained Inline Control

S2 Pro enables localized control over speech generation by embedding natural-language instructions directly within the text using [tag] syntax. Rather than relying on a fixed set of predefined tags, S2 Pro accepts free-form textual descriptions — such as [whisper in small voice], [professional broadcast tone], or [pitch up] — allowing open-ended expression control at the word level.

Common tags (15,000+ unique tags supported):

[pause] [emphasis] [laughing] [inhale] [chuckle] [tsk] [singing] [excited] [laughing tone] [interrupting] [chuckling] [excited tone] [volume up] [echo] [angry] [low volume] [sigh] [low voice] [whisper] [screaming] [shouting] [loud] [surprised] [short pause] [exhale] [delight] [panting] [audience laughter] [with strong accent] [volume down] [clearing throat] [sad] [moaning] [shocked]

Supported Languages

S2 Pro supports 80+ languages.

Tier 1: Japanese (ja), English (en), Chinese (zh)

Tier 2: Korean (ko), Spanish (es), Portuguese (pt), Arabic (ar), Russian (ru), French (fr), German (de)

Other supported languages: sv, it, tr, no, nl, cy, eu, ca, da, gl, ta, hu, fi, pl, et, hi, la, ur, th, vi, jw, bn, yo, sl, cs, sw, nn, he, ms, uk, id, kk, bg, lv, my, tl, sk, ne, fa, af, el, bo, hr, ro, sn, mi, yi, am, be, km, is, az, sd, br, sq, ps, mn, ht, ml, sr, sa, te, ka, bs, pa, lt, kn, si, hy, mr, as, gu, fo, and more.

Production Streaming Performance

On a single NVIDIA H200 GPU:

Real-Time Factor (RTF): 0.195
Time-to-first-audio: ~100 ms
Throughput: 3,000+ acoustic tokens/s while maintaining RTF below 0.5

License

This model is licensed under the Fish Audio Research License. Research and non-commercial use is permitted free of charge. Commercial use requires a separate license from Fish Audio — contact business@fish.audio.

Downloads last month: -

Safetensors

Model size

5B params

Tensor type

BF16

I32

baicai1145
/

s2-pro-w4a16