📂 Part of the Lance MLX collection on mlx-community.

Lance-3B-8bit (MLX, image specialist, 8-bit quantized)

8-bit groupwise affine quantization of mlx-community/Lance-3B-bf16, the image-specialist Lance checkpoint. Produced via mlx-lm's quantize_model utility with a per-tower skip predicate (time_embedder, llm2vae, and vae_in_proj kept at bf16 for numerical safety; the bulk LLM weights — attention projections, MLP, embeddings, lm_head — quantized).

Status

🟢 Production-ready for image tasks on Apple Silicon as of 2026-05-21.

Capability Status Speedup vs bf16
t2i (text → image) ✅ Photorealistic, prompt-aligned ~2.7× faster (75 s vs 201 s for 768² × 30 steps × CFG=4.0)
image_edit (instruction-based) ✅ Identity + style preservation ~2.5× faster expected
x2t_image (image VQA) ✅ Content-correct similar / faster

Memory footprint: 6.59 GB on disk (53% of the bf16 12.37 GB). Runtime RAM ~8–10 GB, comfortable on a 16 GB Mac.

Quality notes vs bf16

  • Photorealism + content fidelity preserved. Cats, dragons, portraits, etc., all generate cleanly.
  • Fine text on generated objects shows slight degradation. E.g. "STOP" on a sign may render as "SNICS" or similar near-miss. The content is otherwise correct (correct color, correct rectangular sign shape, recognizable text-like glyphs).
  • For prompts that don't require legible in-image text, output is visually indistinguishable from bf16 to a casual eye.

Quickstart

from huggingface_hub import snapshot_download
weights = snapshot_download("mlx-community/Lance-3B-8bit")

Text-to-image

from lance_mlx.pipeline.t2i import TextToImagePipeline

pipe = TextToImagePipeline.from_pretrained(
    lance_weights_dir=weights,
    vae_safetensors=f"{weights}/vae.safetensors",
)
image = pipe.generate(
    "A photorealistic tabby cat in a sunlit window.",
    height=768, width=768, num_steps=30, cfg_scale=4.0, seed=42,
)
image.save("cat.png")

Image editing + VQA

Same API as the bf16 variant — ImageEditPipeline and UnderstandingPipeline both pick up the quantization block in config.json automatically via lance_mlx.model._loader.load_lance_model.

What's quantized vs skipped

Component Quantization Why
embed_tokens (151,936 × 2,048) ✅ 8-bit Big, tolerant
lm_head (151,936 × 2,048) ✅ 8-bit Big, used in AR decode only
32 layers × q/k/v/o_proj (UND) ✅ 8-bit Bulk of LLM compute
32 layers × q/k/v/o_proj_moe_gen (GEN) ✅ 8-bit Bulk of GEN compute
32 layers × mlp.{up,gate,down}_proj ✅ 8-bit Bulk of LLM compute
32 layers × mlp_moe_gen.{up,gate,down} ✅ 8-bit Bulk of GEN compute
time_embedder.proj_in/out ❌ bf16 Timestep info, numerically sensitive
llm2vae (flow head, 2048 × 48) ❌ bf16 Tiny + critical to flow prediction
vae_in_proj.vae2llm (2048 × 48) ❌ bf16 Auto-skipped (input_dim 48 ≠ 64*k)
latent_pos_embed.pos_embed ❌ bf16 Custom param holder, no to_quantized
All RMSNorms + QK-norms ❌ bf16 F32 / bf16 norm scales preserved
Wan2.2 VAE (encoder + decoder) ❌ bf16 Pixel fidelity matters
Qwen2.5-VL ViT ❌ bf16 Semantic fidelity matters for x2t

Recipe: 8-bit affine, group_size 64. quantization_report.json in this repo has full provenance.

Why no Video 8-bit yet

The video specialist (Lance_3B_Video) does not quantize cleanly to 8-bit with this recipe — t2v output collapses to a gray gradient regardless of whether the GEN tower is included or skipped, and finer group_sizes don't help. The video-specialist fine-tune has different weight distributions that affine 8-bit can't capture.

Reza2kn/lance-quant's findings suggest DWQ (dynamic weight quantization) with calibration is the right approach for Lance video at 8-bit and below. That's a Phase 5c project. For now, use mlx-community/Lance-3B-Video-bf16 at bf16 for video tasks.

Files in this repo

File Size Notes
model.safetensors 6.59 GB Quantized LLM weights (2033 tensors: each Linear becomes weight + scales + biases)
vit.safetensors 1.34 GB bf16 (not quantized)
vae.safetensors 1.41 GB bf16 (not quantized)
config.json With quantization block (bits=8, group_size=64, mode=affine)
quantization_report.json Provenance + footprint stats
tokenizer.json / vocab.json Qwen2.5-VL vocabulary

Architecture (same as the bf16 variant)

See mlx-community/Lance-3B-bf16 for the full architecture description.

License

This MLX port + quantization: Apache 2.0.

Underlying weights:

  • Lance: Apache 2.0 (ByteDance Intelligent Creation Lab).
  • Wan2.2 VAE: Apache 2.0 (Alibaba).
  • Qwen2.5-VL: Apache 2.0 (Alibaba).

Citation

@article{fu2026lance,
  title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
  author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
  journal={arXiv preprint arXiv:2605.18678},
  year={2026}
}

Links

Downloads last month
21
Safetensors
Model size
2B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/Lance-3B-8bit

Quantized
(15)
this model

Collection including mlx-community/Lance-3B-8bit

Paper for mlx-community/Lance-3B-8bit