ARCHITECTURE.md · pltobing/streaming-speech-translation at main

streaming-speech-translation / ARCHITECTURE.md

pltobing

Update README command & sample rate

23577ee 5 days ago

preview code

raw

history blame contribute delete

10.7 kB

	# Architecture: Streaming Speech Translation Pipeline

	## Overview

	End-to-end streaming speech translation pipeline:

	```
	Audio Input → ASR → NMT → TTS → Audio Output
	(PCM16) (ONNX) (GGUF) (ONNX) (PCM16)
	```

	Translates spoken English into spoken Russian in real-time with streaming output.

	## Pipeline Stages

	### 1. ASR — Cache-Aware Streaming ASR (NeMo Conformer RNN-T)

	Model: NVIDIA NeMo Conformer RNN-T, exported to ONNX.

	Architecture: 3 internal threads connected by `queue.Queue`:

	```
	Audio → [Preprocess Thread] → [Encoder Thread] → [Decoder Thread] → text deltas
	```

	- Preprocess: Buffers raw audio, extracts mel-spectrogram features, chunks them per CacheAwareStreamingConfig (chunk_size=[49,56], shift_size=[49,56])
	- Encoder: Runs ONNX encoder inference on feature chunks, maintains encoder cache state
	- Decoder: Runs RNN-T decoder (joint network), produces incremental text tokens

	Key classes: `CacheAwareStreamingAudioBuffer`, `CacheAwareStreamingASR`, `ASRModelPackage`
	Wrapper: `StreamingASR` — exposes `push_audio_chunk()` / `get_transcript_chunk()` API

	### 2. NMT — Streaming Segmented Translation (TranslateGemma)

	Model: TranslateGemma 4B (GGUF, Q8_0) via llama-cpp-python.

	Architecture: Single-threaded, three internal components:

	```
	text deltas → [Segmenter] → text segments → [Translator] → raw translations → [Merger] → display text
	```

	- StreamingSegmenter: Batches ASR tokens into word-groups (max 5 words + 2 hold-back). Triggers on punctuation, pause (>700ms), or max-token boundaries (min 3 words)
	- StreamingTranslator: Multi-turn translation using init/continuation prompt templates with KV cache warming
	- StreamingTranslationMerger: Handles revision/append/continuation logic for incremental translations. Detects trailing ellipsis (incomplete), leading ellipsis (continuation), and word-level LCP revision

	Wrapper: `StreamingNMT` — exposes `push_text_chunk()` / `flush()` / `check_pause()` API

	### 3. TTS — Streaming XTTS v2 (ONNX)

	Model: XTTSv2 with ONNX-exported GPT-2 AR model + HiFi-GAN vocoder.

	Architecture: Sequential within a single call:

	```
	text → [BPE Tokenizer] → [GPT-2 AR Loop] → mel latents → [HiFi-GAN Vocoder] → audio chunks
	```

	- Speaker conditioning: One-time compute from reference audio → `gpt_cond_latent` [1,32,1024] + `speaker_embedding` [1,512,1]
	- AR generation: GPT-2 autoregressive loop producing audio token latents. Accumulates `stream_chunk_size` (default 20) tokens before running vocoder
	- Vocoder: HiFi-GAN converts accumulated latents to waveform
	- Crossfade stitching: Linear fade-in/fade-out between consecutive vocoder output chunks for seamless playback

	Output: 24kHz float32 audio chunks
	Wrapper: `StreamingTTS` — exposes `synthesize_stream()` generator API

	## Concurrency Model

	```
	┌─────────────────────────────────────────────────────────────────┐
	│ asyncio Event Loop │
	│ │
	│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
	│ │ WebSocket I/O │ │ asr_to_nmt │ │ tts_synthesis │ │
	│ │ handler │ │ loop │ │ loop │ │
	│ └──────┬───────┘ └──────┬───────┘ └────────┬─────────┘ │
	│ │ │ │ │
	│ │ run_in_executor() │ run_in_executor() │ run_in_ex… │
	│ ▼ ▼ ▼ │
	│ ┌──────────────────────────────────────────────────────────┐ │
	│ │ ThreadPoolExecutor (4 workers) │ │
	│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ │
	│ │ │ ASR internal │ │ NMT blocking │ │ TTS blocking │ │ │
	│ │ │ threads (3) │ │ llama-cpp │ │ ONNX inference │ │ │
	│ │ └──────────────┘ └──────────────┘ └──────────────────┘ │ │
	│ └──────────────────────────────────────────────────────────┘ │
	└─────────────────────────────────────────────────────────────────┘
	```

	- asyncio loop: Handles WebSocket I/O and coordinates pipeline stages
	- ASR threads: 3 dedicated daemon threads (preprocess, encoder, decoder)
	- NMT: Blocking llama-cpp inference bridged via `run_in_executor()`
	- TTS: Blocking ONNX inference bridged via `run_in_executor()`
	- Concurrency: All per-session state (including NMT turn context: prev_translation, prev_query) is held in per-session wrapper objects. The shared model weights (ASRModelPackage, StreamingTranslator, StreamingTTSPipeline) hold no mutable per-session state after initialization, making concurrent sessions safe.

	## Data Flow & Queues

	```
	WebSocket binary → push_audio()
	│
	▼
	┌──────────────┐
	│ audio_queue │ (maxsize=256, queue.Queue)
	└──────┬───────┘
	▼
	[ASR Internal Threads]
	│
	▼
	┌──────────────┐
	│ output_queue │ (maxsize=64, queue.Queue)
	└──────┬───────┘
	▼
	[asr_to_nmt_loop via executor]
	│
	┌──────┴───────┐
	▼ ▼
	transcript_queue tts_text_queue (maxsize=16, asyncio.Queue)
	(→ WebSocket) │
	▼
	[tts_synthesis_loop via executor]
	│
	▼
	┌──────────────┐
	│audio_out_queue│ (maxsize=32, asyncio.Queue)
	└──────┬───────┘
	▼
	WebSocket binary
	```

	## Backpressure Strategy

	- Audio input: `put_nowait` with drop-on-full (acceptable to lose frames vs. building latency)
	- ASR→NMT: `put_nowait` with drop warning on encoder/decoder output queues
	- NMT→TTS: `put_nowait` with drop warning (translations can be reconstructed from next segment)
	- TTS→Output: `put_nowait` with drop warning per audio chunk
	- All queue sizes are configurable via `PipelineConfig`

	## Model Loading & Session Lifecycle

	Startup: Models loaded ONCE in `TranslationServer._load_models()`:
	- ASR ONNX sessions (`ASRModelPackage`)
	- NMT GGUF model (`StreamingTranslator`) + KV cache warmup
	- TTS ONNX sessions (`StreamingTTSPipeline`)

	Per-session: Each WebSocket connection creates:
	- `StreamingASR` — own audio buffers, streaming state, thread pool
	- `StreamingNMT` — own segmenter, merger, and translation context (prev_translation, prev_query); shares model weights only
	- `StreamingTTS` — own speaker conditioning; shares ONNX sessions

	Cleanup: On disconnect, orchestrator flushes remaining NMT text through TTS, then stops all threads and resets state.


	## WebSocket Protocol

	\| Direction \| Type \| Format \| Description \|
	\|-----------\|---------\|--------\|-------------\|
	\| Client→ \| Binary \| PCM16 \| Raw audio at declared sample rate \|
	\| Client→ \| Text \| JSON \| `{"action": "start", "sample_rate": 16000}` \|
	\| Client→ \| Text \| JSON \| `{"action": "stop"}` \|
	\| →Client \| Binary \| PCM16 \| Synthesized audio at 24kHz \|
	\| →Client \| Text \| JSON \| `{"type": "transcript", "text": "..."}` \|
	\| →Client \| Text \| JSON \| `{"type": "translation", "text": "..."}` \|
	\| →Client \| Text \| JSON \| `{"type": "status", "status": "started"}` \|

	## Configuration

	All tunables are in `PipelineConfig` (dataclass) and exposed as CLI args:

	\| Parameter \| Default \| Description \|
	\|-----------\|---------\|-------------\|
	\| `asr_chunk_duration_ms` \| 10 \| Audio chunk duration for ASR \|
	\| `nmt_n_threads` \| 4 \| CPU threads for llama-cpp \|
	\| `tts_stream_chunk_size` \| 20 \| AR tokens per vocoder chunk \|
	\| `audio_queue_maxsize` \| 256 \| Audio input queue bound \|
	\| `tts_queue_maxsize` \| 16 \| NMT→TTS text queue bound \|
	\| `audio_out_queue_maxsize` \| 32 \| TTS→output audio queue bound \|

	## File Structure

	```
	src/
	├── asr/
	│ ├── streaming_asr.py # StreamingASR wrapper
	│ ├── pipeline.py # ThreadedSpeechTranslator (reference)
	│ ├── cache_aware_modules.py # Audio buffer + streaming ASR
	│ ├── modules.py # ONNX model loading
	│ └── utils.py # Audio utilities
	├── nmt/
	│ ├── streaming_nmt.py # StreamingNMT wrapper
	│ ├── streaming_segmenter.py # Word-group segmentation
	│ ├── streaming_translation_merger.py # Translation merging
	│ └── translator_module.py # TranslateGemma via llama-cpp
	├── tts/
	│ ├── streaming_tts.py # StreamingTTS wrapper
	│ ├── xtts_streaming_pipeline.py # Full TTS pipeline
	│ ├── xtts_onnx_orchestrator.py # GPT-2 AR + vocoder
	│ ├── xtts_tokenizer.py # BPE tokenizer
	│ └── zh_num2words.py # Chinese text normalization
	├── pipeline/
	│ ├── orchestrator.py # PipelineOrchestrator
	│ └── config.py # PipelineConfig
	└── server/
	└── websocket_server.py # WebSocket server
	```