Magpie TTS Multilingual 357M (CoreML)

CoreML export of NVIDIA's Magpie TTS Multilingual 357M, optimized for on-device inference on Apple Silicon. Ships both .mlmodelc (compiled, ready-to-run) and .mlpackage (portable source) for macOS 14+ / iOS 17+.

Converted with FluidInference/mobius. Consumed by the Swift port in FluidInference/FluidAudio (see Sources/FluidAudio/TTS/Magpie/).

Languages

English, Spanish, German, French, Italian, Vietnamese, Mandarin, Hindi. Japanese is not yet included. French / Italian / Vietnamese use ByT5 byte-level tokenization (algorithmic, no lookup files).

├── manifest.json                       # machine-readable index (sha256, shapes, IO specs)
├── text_encoder.{mlmodelc,mlpackage}   # text → (1, 256, 768) encoder output
├── decoder_step.{mlmodelc,mlpackage}   # 12-layer AR decoder (stateful KV cache)
├── decoder_prefill.{mlmodelc,mlpackage}# batched prefill fast path
├── nanocodec_decoder.{mlmodelc,mlpackage} # 8-codebook → PCM vocoder, 22050 Hz, max 256 frames
├── constants/
│   ├── constants.json                  # d_model, n_layers, special token ids, sampling defaults
│   ├── speaker_info.json               # 5 speakers, context shape (110, 768)
│   ├── tokenizer_{info,metadata,references}.json
│   ├── speaker_0.npy .. speaker_4.npy
│   ├── speaker_embeddings_raw.npy
│   ├── text_embedding.npy
│   ├── audio_embedding_0.npy .. audio_embedding_7.npy   # per-codebook (2024, 768)
│   └── local_transformer/              # 1-layer transformer weights (Swift reads .npy)
└── tokenizer/
    ├── english_phoneme_*.json          # phoneme_dict + token2id + heteronyms
    ├── spanish_phoneme_*.json          # phoneme_dict + token2id
    ├── german_phoneme_*.json           # phoneme_dict + token2id + heteronyms
    ├── mandarin_phoneme_*.json         # token2id, phoneme_dict, pinyin/tone/ascii dicts
    ├── mandarin_jieba_dict.json
    ├── mandarin_pypinyin_{char,phrase}_dict.json
    └── hindi_chartokenizer_token2id.json

Prefer .mlmodelc at runtime; .mlpackage is included for inspection and re-compilation. Consult manifest.json for the full asset table (file sizes, sha256, model IO shapes per layer).

Usage (Swift)

import FluidAudio

let manager = try await MagpieTtsManager.downloadAndCreate(
    languages: [.english, .spanish]
)
let result = try await manager.synthesize(
    text: "Hello | ˈ n ɛ m o ʊ | from FluidAudio.",
    speaker: .john,
    language: .english
)
let wav = AudioWAV.data(from: result.samples, sampleRate: result.sampleRate)
try wav.write(to: URL(fileURLWithPath: "hello.wav"))

The manager lazy-downloads everything in this repo on first use.

Inline IPA override

Text enclosed in |...| is passed straight to the tokenizer as whitespace-separated IPA tokens:

"Hello | ˈ n ɛ m o ʊ | world"

License

CoreML export: CC-BY-4.0 (inherits from the upstream NeMo model).
Upstream weights: see nvidia/magpie_tts_multilingual_357m.

Downloads last month: -

Model tree for FluidInference/magpie-tts-multilingual-357m-coreml

Base model

nvidia/magpie_tts_multilingual_357m

Quantized

(1)

this model