# Architecture: Streaming Speech Translation Pipeline

## Overview

End-to-end streaming speech translation pipeline:

```
Audio Input → ASR → NMT → TTS → Audio Output
  (PCM16)   (ONNX) (GGUF) (ONNX)  (PCM16)
```

Translates spoken English into spoken Russian in real-time with streaming output.

## Pipeline Stages

### 1. ASR — Cache-Aware Streaming ASR (NeMo Conformer RNN-T)

**Model**: NVIDIA NeMo Conformer RNN-T, exported to ONNX.

**Architecture**: 3 internal threads connected by `queue.Queue`:

```
Audio → [Preprocess Thread] → [Encoder Thread] → [Decoder Thread] → text deltas
```

- **Preprocess**: Buffers raw audio, extracts mel-spectrogram features, chunks them per CacheAwareStreamingConfig (chunk_size=[49,56], shift_size=[49,56])
- **Encoder**: Runs ONNX encoder inference on feature chunks, maintains encoder cache state
- **Decoder**: Runs RNN-T decoder (joint network), produces incremental text tokens

**Key classes**: `CacheAwareStreamingAudioBuffer`, `CacheAwareStreamingASR`, `ASRModelPackage`
**Wrapper**: `StreamingASR` — exposes `push_audio_chunk()` / `get_transcript_chunk()` API

### 2. NMT — Streaming Segmented Translation (TranslateGemma)

**Model**: TranslateGemma 4B (GGUF, Q8_0) via llama-cpp-python.

**Architecture**: Single-threaded, three internal components:

```
text deltas → [Segmenter] → text segments → [Translator] → raw translations → [Merger] → display text
```

- **StreamingSegmenter**: Batches ASR tokens into word-groups (max 5 words + 2 hold-back). Triggers on punctuation, pause (>700ms), or max-token boundaries (min 3 words)
- **StreamingTranslator**: Multi-turn translation using init/continuation prompt templates with KV cache warming
- **StreamingTranslationMerger**: Handles revision/append/continuation logic for incremental translations. Detects trailing ellipsis (incomplete), leading ellipsis (continuation), and word-level LCP revision

**Wrapper**: `StreamingNMT` — exposes `push_text_chunk()` / `flush()` / `check_pause()` API

### 3. TTS — Streaming XTTS v2 (ONNX)

**Model**: XTTSv2 with ONNX-exported GPT-2 AR model + HiFi-GAN vocoder.

**Architecture**: Sequential within a single call:

```
text → [BPE Tokenizer] → [GPT-2 AR Loop] → mel latents → [HiFi-GAN Vocoder] → audio chunks
```

- **Speaker conditioning**: One-time compute from reference audio → `gpt_cond_latent` [1,32,1024] + `speaker_embedding` [1,512,1]
- **AR generation**: GPT-2 autoregressive loop producing audio token latents. Accumulates `stream_chunk_size` (default 20) tokens before running vocoder
- **Vocoder**: HiFi-GAN converts accumulated latents to waveform
- **Crossfade stitching**: Linear fade-in/fade-out between consecutive vocoder output chunks for seamless playback

**Output**: 24kHz float32 audio chunks
**Wrapper**: `StreamingTTS` — exposes `synthesize_stream()` generator API

## Concurrency Model

```
┌─────────────────────────────────────────────────────────────────┐
│                     asyncio Event Loop                          │
│                                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────┐  │
│  │ WebSocket I/O │    │ asr_to_nmt   │    │ tts_synthesis    │  │
│  │   handler     │    │   loop       │    │    loop          │  │
│  └──────┬───────┘    └──────┬───────┘    └────────┬─────────┘  │
│         │                   │                      │            │
│         │ run_in_executor() │ run_in_executor()    │ run_in_ex… │
│         ▼                   ▼                      ▼            │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │              ThreadPoolExecutor (4 workers)               │  │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐  │  │
│  │  │ ASR internal  │ │ NMT blocking │ │ TTS blocking     │  │  │
│  │  │ threads (3)   │ │ llama-cpp    │ │ ONNX inference   │  │  │
│  │  └──────────────┘ └──────────────┘ └──────────────────┘  │  │
│  └──────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
```

- **asyncio loop**: Handles WebSocket I/O and coordinates pipeline stages
- **ASR threads**: 3 dedicated daemon threads (preprocess, encoder, decoder)
- **NMT**: Blocking llama-cpp inference bridged via `run_in_executor()`
- **TTS**: Blocking ONNX inference bridged via `run_in_executor()`
- **Concurrency**: All per-session state (including NMT turn context: prev_translation, prev_query) is held in per-session wrapper objects. The shared model weights (ASRModelPackage, StreamingTranslator, StreamingTTSPipeline) hold no mutable per-session state after initialization, making concurrent sessions safe.

## Data Flow & Queues

```
WebSocket binary → push_audio()
                      │
                      ▼
              ┌──────────────┐
              │  audio_queue  │ (maxsize=256, queue.Queue)
              └──────┬───────┘
                     ▼
            [ASR Internal Threads]
                     │
                     ▼
              ┌──────────────┐
              │ output_queue  │ (maxsize=64, queue.Queue)
              └──────┬───────┘
                     ▼
         [asr_to_nmt_loop via executor]
                     │
              ┌──────┴───────┐
              ▼              ▼
     transcript_queue   tts_text_queue (maxsize=16, asyncio.Queue)
     (→ WebSocket)           │
                             ▼
              [tts_synthesis_loop via executor]
                             │
                             ▼
                    ┌──────────────┐
                    │audio_out_queue│ (maxsize=32, asyncio.Queue)
                    └──────┬───────┘
                           ▼
                   WebSocket binary
```

## Backpressure Strategy

- **Audio input**: `put_nowait` with drop-on-full (acceptable to lose frames vs. building latency)
- **ASR→NMT**: `put_nowait` with drop warning on encoder/decoder output queues
- **NMT→TTS**: `put_nowait` with drop warning (translations can be reconstructed from next segment)
- **TTS→Output**: `put_nowait` with drop warning per audio chunk
- All queue sizes are configurable via `PipelineConfig`

## Model Loading & Session Lifecycle

**Startup**: Models loaded ONCE in `TranslationServer._load_models()`:
- ASR ONNX sessions (`ASRModelPackage`)
- NMT GGUF model (`StreamingTranslator`) + KV cache warmup
- TTS ONNX sessions (`StreamingTTSPipeline`)

**Per-session**: Each WebSocket connection creates:
- `StreamingASR` — own audio buffers, streaming state, thread pool
- `StreamingNMT` — own segmenter, merger, and translation context (prev_translation, prev_query); shares model weights only
- `StreamingTTS` — own speaker conditioning; shares ONNX sessions

**Cleanup**: On disconnect, orchestrator flushes remaining NMT text through TTS, then stops all threads and resets state.


## WebSocket Protocol

| Direction | Type    | Format | Description |
|-----------|---------|--------|-------------|
| Client→   | Binary  | PCM16  | Raw audio at declared sample rate |
| Client→   | Text    | JSON   | `{"action": "start", "sample_rate": 16000}` |
| Client→   | Text    | JSON   | `{"action": "stop"}` |
| →Client   | Binary  | PCM16  | Synthesized audio at 24kHz |
| →Client   | Text    | JSON   | `{"type": "transcript", "text": "..."}` |
| →Client   | Text    | JSON   | `{"type": "translation", "text": "..."}` |
| →Client   | Text    | JSON   | `{"type": "status", "status": "started"}` |

## Configuration

All tunables are in `PipelineConfig` (dataclass) and exposed as CLI args:

| Parameter | Default | Description |
|-----------|---------|-------------|
| `asr_chunk_duration_ms` | 10 | Audio chunk duration for ASR |
| `nmt_n_threads` | 4 | CPU threads for llama-cpp |
| `tts_stream_chunk_size` | 20 | AR tokens per vocoder chunk |
| `audio_queue_maxsize` | 256 | Audio input queue bound |
| `tts_queue_maxsize` | 16 | NMT→TTS text queue bound |
| `audio_out_queue_maxsize` | 32 | TTS→output audio queue bound |

## File Structure

```
src/
├── asr/
│   ├── streaming_asr.py          # StreamingASR wrapper
│   ├── pipeline.py               # ThreadedSpeechTranslator (reference)
│   ├── cache_aware_modules.py    # Audio buffer + streaming ASR
│   ├── modules.py                # ONNX model loading
│   └── utils.py                  # Audio utilities
├── nmt/
│   ├── streaming_nmt.py          # StreamingNMT wrapper
│   ├── streaming_segmenter.py    # Word-group segmentation
│   ├── streaming_translation_merger.py  # Translation merging
│   └── translator_module.py      # TranslateGemma via llama-cpp
├── tts/
│   ├── streaming_tts.py          # StreamingTTS wrapper
│   ├── xtts_streaming_pipeline.py  # Full TTS pipeline
│   ├── xtts_onnx_orchestrator.py   # GPT-2 AR + vocoder
│   ├── xtts_tokenizer.py          # BPE tokenizer
│   └── zh_num2words.py            # Chinese text normalization
├── pipeline/
│   ├── orchestrator.py           # PipelineOrchestrator
│   └── config.py                 # PipelineConfig
└── server/
    └── websocket_server.py       # WebSocket server
```