📂 Part of the Lance MLX collection on mlx-community.

Lance-3B-Video-bf16 (MLX, video specialist)

MLX port of ByteDance Intelligent Creation Lab's Lance — the video-specialist Lance_3B_Video checkpoint, converted to bf16 for Apple Silicon. ~6.44 B LLM parameters + 669 M Qwen2.5-VL ViT bundled, with the 126,976-entry latent_pos_embed table needed for video-scale latent grids.

Lance is ByteDance's 3B-active unified multimodal model (paper, code, HF original). This is not Lance/LanceDB, the columnar data format.

Status — 🚧 t2v port quality issue under investigation (2026-05-21)

Honest correction: an earlier version of this model card described Lance_3B_Video's painterly t2v aesthetic as "by design." Direct comparison against the Phase 0 PyTorch oracle (same weights, same prompt, same seed/scale at 768×768×50f) shows this was wrong — PyTorch Lance_3B_Video produces photorealistic 3D-cinematic output. Our MLX t2v port consistently produces softer painterly renderings instead.

This is a port-side bug (likely numerical or routing). Tracking and debugging in xocialize/lance-mlx issue #2. Highest-prior candidate at the moment: the MaPE temporal anchor we apply (ANCHOR_VIDEO_GEN=2000) which upstream's shift_position_ids doesn't actually fire for pure t2v.

Capability Status Notes
t2v at 256² × ≤25f 🟡 End-to-end works, painterly Subject + scene recognizable; quality below PyTorch reference
t2v at 768² × ≤25f 🟡 End-to-end works, painterly Same — content correct, fidelity blocked on port fix
t2v at oracle scale (768²×50f) ⏳ Not yet measured at oracle scale in MLX Definitive test pending
x2t_video (video VQA / captioning) ✅ Validated against Phase 0 oracle. Unaffected by t2v bug — ViT + UND-tower path only
video_edit (instruction-based) 🚧 Inherits t2v quality issue End-to-end runs; cinematic quality awaits t2v fix

For production-quality image tasks (t2i, image_edit, x2t_image), use mlx-community/Lance-3B-bf16 (or mlx-community/Lance-3B-8bit for 16 GB Macs). Those reproduce the PyTorch reference quality.

Capability Status Notes
t2v at 256×256 × 16f ✅ Works ~33 s/clip on M5 Max.
t2v at 512×512 × 16f ✅ Works ~60 s/clip.
t2v at 768×768 × 13f (n_lat=9.2k) ✅ Works ~2.5 min/clip. Recognizable subjects (red panda with cap → "dog with hat").
t2v at 768×768 × 17f (n_lat=11.5k) ✅ Works ~20 min/clip. "Five balls on a wooden table" → recognizable balls on wood texture, varied colors.
t2v at 768×768 × 25f (n_lat=16.1k) 🟡 Validated; see commit notes
t2v at 768×768 × 49f (n_lat=30k) ⚠️ Functional but slow (~2¼h/clip on M5 Max). Memory and time become impractical for casual use.
x2t_video (video VQA / captioning) ✅ Validated against Phase 0 oracle. Cooking-video VQA produces content-correct 256-token caption (kitchen + pan + spatula + tomato + meat + stirring all matched) in 17.5 s.
video_edit (instruction-based) ✅ Functional. "Change all the balls to a deep red color." → balls recolored, composition preserved. 17 frames × 256² in 81.6 s.

For production-quality photorealistic image tasks (t2i, image_edit, x2t_image), use the sibling repo mlx-community/Lance-3B-bf16 — Lance_3B is the image specialist with crystal aesthetic.

Why "painterly" is the wrong framing

Earlier we believed Lance_3B_Video's painterly aesthetic was a deliberate fine-tune choice. The Phase 4c per-tensor diff (_moe_gen QK-norms differ by 0.5–0.85 in 6+ layers; lm_head and embed_tokens byte-identical with Lance_3B) is real — but the conclusion we drew from it was wrong. The Phase 0 PyTorch oracle, generated from these exact weights at validation_timestep_shift 3.5, cfg_text_scale 4.0, validation_num_timesteps 30, produces clean photorealistic / 3D-cinematic output. Same model, same config — different aesthetic. That means the painterly look we see is our MLX port doing something wrong, not the model expressing an intentional style.

Why a separate "Video" checkpoint?

ByteDance ships two variants of Lance that differ in fine-tuning:

  • Lance_3B — image specialist. Crystal-clear photorealistic t2i.
  • Lance_3B_Video — video specialist. Same architecture, further fine-tuned on video data. Bundles the Qwen2.5-VL ViT (669 M) and the larger 126,976-entry latent_pos_embed table that addresses video-resolution token grids.

Quickstart

Install from the lance-mlx source repo:

git clone https://github.com/xocialize/lance-mlx
cd lance-mlx && uv sync

Download this checkpoint:

from huggingface_hub import snapshot_download
weights = snapshot_download("mlx-community/Lance-3B-Video-bf16")

Text-to-video

from lance_mlx.pipeline.t2v import TextToVideoPipeline

pipe = TextToVideoPipeline.from_pretrained(
    lance_weights_dir=weights,
    vae_safetensors=f"{weights}/vae.safetensors",
)
frames = pipe.generate(
    "Five balls on a wooden table: two blue, three green.",
    num_frames=17, height=768, width=768,
    num_steps=30, cfg_scale=4.0, seed=42,
)
# frames is np.ndarray of shape (T_decoded, H, W, 3) uint8

Encode to MP4 with imageio:

import imageio
with imageio.get_writer("out.mp4", fps=12, codec="libx264") as writer:
    for f in frames:
        writer.append_data(f)

Video understanding

from lance_mlx.pipeline.understanding import UnderstandingPipeline

pipe = UnderstandingPipeline.from_pretrained(
    lance_weights_dir=weights,
    vit_safetensors=f"{weights}/vit.safetensors",
)
answer = pipe.generate_video(
    video="my_video.mp4",
    question="Describe what happens in this video.",
    num_sample_frames=16, target_h=224, target_w=224,
    max_new_tokens=256, prompt_style="lance",
)
print(answer)

Validated content-correct against the Phase 0 oracle's cooking VQA case (kitchen + pan + spatula + tomato + meat + stirring matched).

Video editing

from lance_mlx.pipeline.video_edit import VideoEditPipeline

pipe = VideoEditPipeline.from_pretrained(
    lance_weights_dir=weights,
    vae_safetensors=f"{weights}/vae.safetensors",
)
frames = pipe.generate(
    input_video="my_video.mp4",
    instruction="Change all the balls to a deep red color.",
    height=256, width=256, num_frames=17,
    num_steps=30, cfg_scale=4.0, seed=42,
)

Performance (M5 Max 128 GB)

Task Configuration Wall-clock
t2v 256² × 16f, 30 steps, CFG=4.0 ~33 s
t2v 512² × 16f, 30 steps, CFG=4.0 ~60 s
t2v 768² × 13f, 30 steps, CFG=4.0 ~145 s
t2v 768² × 17f, 30 steps, CFG=4.0 ~20 min
t2v 768² × 49f, 30 steps, CFG=4.0 ~2¼ hours (impractical)

CFG doubles the forward cost since cond + uncond run sequentially. Attention scales O(N²) in latent-token count, so high-frame, high-resolution combos become quickly impractical. KV cache for the text prefix is a Phase 5 follow-up.

Files in this repo

File Size Purpose
model.safetensors 12.87 GB LLM weights (1021 tensors, both UND + GEN towers, with 126,976-entry latent_pos_embed)
vit.safetensors 1.34 GB Qwen2.5-VL ViT (semantic encoder for x2t_video)
vae.safetensors 1.41 GB Lance's bundled Wan2.2 VAE (also available standalone as mlx-community/Wan2.2-VAE-Lance-bf16)
config.json Qwen2_5_VLForConditionalGeneration config
conversion_report.json Provenance
tokenizer.json / vocab.json Qwen2.5-VL vocabulary

Provenance

Source: bytedance-research/Lance/Lance_3B_Video/model.safetensors (1411 tensors including bundled ViT; 6.437 B LLM + 0.669 B ViT params). Converted via scripts/02_convert.py. The bundled ViT is extracted to a sibling vit.safetensors with the vit_model. prefix stripped, matching the layout convention of the image-specialist repo.

Tips

  • Use concrete-subject prompts. "Five red apples in a bowl" works better than "the joy of friendship in motion." The model can render abstract scenes, but the painterly aesthetic on already-abstract subjects can read as overly abstract.
  • Smaller scales iterate faster. 256² × 16 frames is the fastest test config (~33 s); good for prompt iteration. Scale up once you find a prompt you like.
  • English + Chinese prompts work. Other languages are out of distribution (Qwen2.5-VL was trained primarily on en + zh).

Limitations

  • bf16 only. 4-bit + 8-bit quantization in progress (Phase 5b). Naive INT4 has been observed to degrade the GEN expert per Reza2kn/lance-quant's findings; quantization needs per-tower calibration.
  • No streaming or batched generation.
  • CFG doubles forward cost. A future KV-cache for the text + clean-ref prefix would save ~30% per step.

Architecture (shared with the image specialist)

  • Two expert towers (LLM_UND, LLM_GEN), each initialized from Qwen2.5-VL-3B-Instruct, with per-expert FFN, output projection, and QK-norm.
  • Modality-deterministic routing: text + Qwen2.5-VL ViT semantic tokens → LLM_UND (autoregressive); Wan2.2 VAE latent tokens → LLM_GEN (flow-matching velocity prediction). No learned gate.
  • MaPE — modality-aware RoPE with per-modality temporal anchor.
  • Wan2.2 3D causal VAE (16× spatial / 4× temporal compression, 48-channel latent).
  • Bidirectional attention within latent block.
  • Untied LM head.

License

This MLX port: Apache 2.0.

Underlying weights:

  • Lance: Apache 2.0 (ByteDance Intelligent Creation Lab).
  • Wan2.2 VAE: Apache 2.0 (Alibaba).
  • Qwen2.5-VL: Apache 2.0 (Alibaba).

See NOTICE for attribution.

Citation

@article{fu2026lance,
  title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
  author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
  journal={arXiv preprint arXiv:2605.18678},
  year={2026}
}

Links

Downloads last month
-
Safetensors
Model size
6B params
Tensor type
BF16
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/Lance-3B-Video-bf16

Finetuned
(5)
this model

Collection including mlx-community/Lance-3B-Video-bf16

Paper for mlx-community/Lance-3B-Video-bf16