📂 Part of the Lance MLX collection on mlx-community.

Lance-3B-8bit (MLX, image specialist, 8-bit quantized)

8-bit groupwise affine quantization of mlx-community/Lance-3B-bf16, the image-specialist Lance checkpoint. Produced via mlx-lm's quantize_model utility with a per-tower skip predicate (time_embedder, llm2vae, and vae_in_proj kept at bf16 for numerical safety; the bulk LLM weights — attention projections, MLP, embeddings, lm_head — quantized).

Status

🟢 Production-ready for image tasks on Apple Silicon as of 2026-05-21.

Capability	Status	Speedup vs bf16
t2i (text → image)	✅ Photorealistic, prompt-aligned	~2.7× faster (75 s vs 201 s for 768² × 30 steps × CFG=4.0)
image_edit (instruction-based)	✅ Identity + style preservation	~2.5× faster expected
x2t_image (image VQA)	✅ Content-correct	similar / faster

Memory footprint: 6.59 GB on disk (53% of the bf16 12.37 GB). Runtime RAM ~8–10 GB, comfortable on a 16 GB Mac.

Quality notes vs bf16

Photorealism + content fidelity preserved. Cats, dragons, portraits, etc., all generate cleanly.
Fine text on generated objects shows slight degradation. E.g. "STOP" on a sign may render as "SNICS" or similar near-miss. The content is otherwise correct (correct color, correct rectangular sign shape, recognizable text-like glyphs).
For prompts that don't require legible in-image text, output is visually indistinguishable from bf16 to a casual eye.

Quickstart

from huggingface_hub import snapshot_download
weights = snapshot_download("mlx-community/Lance-3B-8bit")

Text-to-image

from lance_mlx.pipeline.t2i import TextToImagePipeline

pipe = TextToImagePipeline.from_pretrained(
    lance_weights_dir=weights,
    vae_safetensors=f"{weights}/vae.safetensors",
)
image = pipe.generate(
    "A photorealistic tabby cat in a sunlit window.",
    height=768, width=768, num_steps=30, cfg_scale=4.0, seed=42,
)
image.save("cat.png")

Image editing + VQA

Same API as the bf16 variant — ImageEditPipeline and UnderstandingPipeline both pick up the quantization block in config.json automatically via lance_mlx.model._loader.load_lance_model.

What's quantized vs skipped

Component	Quantization	Why
`embed_tokens` (151,936 × 2,048)	✅ 8-bit	Big, tolerant
`lm_head` (151,936 × 2,048)	✅ 8-bit	Big, used in AR decode only
32 layers × `q/k/v/o_proj` (UND)	✅ 8-bit	Bulk of LLM compute
32 layers × `q/k/v/o_proj_moe_gen` (GEN)	✅ 8-bit	Bulk of GEN compute
32 layers × `mlp.{up,gate,down}_proj`	✅ 8-bit	Bulk of LLM compute
32 layers × `mlp_moe_gen.{up,gate,down}`	✅ 8-bit	Bulk of GEN compute
`time_embedder.proj_in/out`	❌ bf16	Timestep info, numerically sensitive
`llm2vae` (flow head, 2048 × 48)	❌ bf16	Tiny + critical to flow prediction
`vae_in_proj.vae2llm` (2048 × 48)	❌ bf16	Auto-skipped (input_dim 48 ≠ 64*k)
`latent_pos_embed.pos_embed`	❌ bf16	Custom param holder, no `to_quantized`
All RMSNorms + QK-norms	❌ bf16	F32 / bf16 norm scales preserved
Wan2.2 VAE (encoder + decoder)	❌ bf16	Pixel fidelity matters
Qwen2.5-VL ViT	❌ bf16	Semantic fidelity matters for x2t

Recipe: 8-bit affine, group_size 64. quantization_report.json in this repo has full provenance.

Why no Video 8-bit yet

The video specialist (Lance_3B_Video) does not quantize cleanly to 8-bit with this recipe — t2v output collapses to a gray gradient regardless of whether the GEN tower is included or skipped, and finer group_sizes don't help. The video-specialist fine-tune has different weight distributions that affine 8-bit can't capture.

Reza2kn/lance-quant's findings suggest DWQ (dynamic weight quantization) with calibration is the right approach for Lance video at 8-bit and below. That's a Phase 5c project. For now, use mlx-community/Lance-3B-Video-bf16 at bf16 for video tasks.

Files in this repo

File	Size	Notes
`model.safetensors`	6.59 GB	Quantized LLM weights (2033 tensors: each Linear becomes weight + scales + biases)
`vit.safetensors`	1.34 GB	bf16 (not quantized)
`vae.safetensors`	1.41 GB	bf16 (not quantized)
`config.json`	–	With `quantization` block (`bits=8, group_size=64, mode=affine`)
`quantization_report.json`	–	Provenance + footprint stats
`tokenizer.json` / `vocab.json`	–	Qwen2.5-VL vocabulary

Architecture (same as the bf16 variant)

See mlx-community/Lance-3B-bf16 for the full architecture description.

License

This MLX port + quantization: Apache 2.0.

Underlying weights:

Lance: Apache 2.0 (ByteDance Intelligent Creation Lab).
Wan2.2 VAE: Apache 2.0 (Alibaba).
Qwen2.5-VL: Apache 2.0 (Alibaba).

Citation

@article{fu2026lance,
  title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
  author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
  journal={arXiv preprint arXiv:2605.18678},
  year={2026}
}

Model tree for mlx-community/Lance-3B-8bit

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

bytedance-research/Lance

Quantized

(15)

this model

Collection including mlx-community/Lance-3B-8bit

Lance MLX

Collection

Feature-complete MLX port of ByteDance Lance: t2i, image_edit, x2t_image, t2v, video_edit, x2t_video. • 4 items • Updated about 12 hours ago • 1

Paper for mlx-community/Lance-3B-8bit

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Paper • 2605.18678 • Published 4 days ago • 69

mlx-community
/

Lance-3B-8bit