MERaLiON-3-10B-MLX-8bit

8-bit MLX quantization of MERaLiON/MERaLiON-3-10B, the instruction-tuned Singaporean multilingual speech-language model from A*STAR / I²R.

This is the instruct variant — MERaLiON-3-10B is published only as the IT model (decoder is Gemma-2-9B-it).

At a glance

Source MERaLiON/MERaLiON-3-10B (bf16, ~19 GB)
Format MLX (Apple Silicon native)
Quantization 8-bit affine, group-size 64
Disk size 11.3 GB (~40% reduction)
Decoder Gemma-2-9B-it (quantized)
Encoder Whisper-large-v3 derived (preserved bf16)
Adaptor Speech projector (preserved bf16)
Tensors quantized 294
Tensors preserved bf16 170 (norms, embeddings, biases, 1-D params, full encoder & adaptor)

Why preserve encoder + adaptor in bf16?

The Whisper encoder and the speech-to-text adaptor are small relative to the 9B decoder and are highly sensitive to quantization noise. Quantizing only the decoder (where >95% of parameters live) captures almost all of the size reduction while protecting the parts of the model most prone to quality degradation under aggressive quantization.

Norms, embeddings, biases, and 1-D parameters are also preserved bf16 following standard MLX quantization practice.

Intended use

Speech understanding and transcription for Singapore-context English, Mandarin, Malay, and Tamil — including code-switched speech and Singlish — running locally on Apple Silicon.

Usage

This model uses MERaLiON-3's custom architecture (MERaLiON3ForConditionalGeneration). It is not directly loadable via stock mlx-lm text-only loaders; you need an MLX runner that knows about the MERaLiON-3 multimodal pipeline (encoder → adaptor → quantized Gemma-2 decoder).

# Pseudocode — exact import path depends on your MLX MERaLiON loader
from your_meralion_mlx_loader import load

model = load("majentik/MERaLiON-3-10B-MLX-8bit")
text = model.transcribe("path/to/audio.wav", language="en")

For reference on the model's input format and prompting, see the original MERaLiON-3-10B model card.

Quantization recipe

Built using MLX's standard affine quantization:

  • bits: 8
  • group_size: 64
  • mode: affine

Only decoder linear layers were quantized. The Whisper encoder and the speech adaptor were left in bf16. Weight norms, embedding tables, biases, and any 1-D tensors were left in bf16.

Hardware requirements

  • Apple Silicon (M1 or newer)
  • ~12 GB free RAM for inference
  • macOS with MLX installed (pip install mlx mlx-lm)

License

This model inherits the license of the source model: the MERaLiON Public License v2. See:

Per the MERaLiON Public License, both research and commercial use are permitted subject to the license terms. Users are responsible for compliance.

Citation

If you use this model, please cite the original MERaLiON authors:

@misc{meralion3,
  title  = {MERaLiON-3: A Multilingual Speech-Language Model for Singapore},
  author = {A*STAR Institute for Infocomm Research (I²R)},
  year   = {2025},
  url    = {https://huggingface.co/MERaLiON/MERaLiON-3-10B}
}

Acknowledgements

  • MERaLiON team at A*STAR I²R for the source model.
  • MLX team at Apple for the quantization framework.
  • This 8-bit MLX build was produced as part of the Majentik on-device speech work.
Downloads last month
41
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support