Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST
Paper • 2509.14128 • Published • 2
This is nvidia/parakeet-tdt-0.6b-v3 converted from NeMo .nemo format to safetensors for use with Modular's MAX inference framework.
The model weights are numerically identical to the original. Only the format and naming conventions have changed. For full model details, benchmarks, training data, and evaluation results, see the original NVIDIA model card.
| File | Description |
|---|---|
model.safetensors |
Encoder weights (NeMo names remapped, Conv2d permuted FCRS→RSCF) |
decoder_joint.npz |
LSTM prediction network + joint network (numpy) |
config.json |
HF-style config with normalize_features, encoder/decoder/joint settings |
spiece.model |
SentencePiece BPE tokenizer (8192 tokens) |
tokenizer_config.json |
HF tokenizer config |
Note: Parakeet architecture support is currently a prototype on a development branch and is not yet merged into the official modular/modular repository. PRs are in progress of being made.
max serve --model-path pherber3/parakeet-tdt-0.6b-v3 --devices cpu
Transcribe audio via the OpenAI-compatible endpoint:
curl http://localhost:8000/v1/audio/transcriptions \
-F file=@audio.wav \
-F "model=pherber3/parakeet-tdt-0.6b-v3"
{"text": "mister Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."}
Converted from the original NeMo .nemo archive using convert_nemo.py:
python convert_nemo.py nvidia/parakeet-tdt-0.6b-v3 -o ./converted-tdt
@article{nvidia_parakeet_tdt_v3,
title={Granary: Speech Recognition and Translation Dataset in 25 European Languages},
author={NVIDIA},
year={2025},
url={https://arxiv.org/abs/2509.14128}
}
Base model
nvidia/parakeet-tdt-0.6b-v3