CogVideoX-2B β€” Ternary Quantized (tritplane3)

Ternary-quantized version of zai-org/CogVideoX-2b.

Produced with ternary-quant using component-aware tritplane3 quantization applied to the Diffusion Transformer (DiT) backbone.

Why this matters

This is a proof-of-concept for ternary post-training quantization on a diffusers-based DiT pipeline. It should be treated as an experimental artifact, not a benchmarked replacement for FP8, int8, or other production video quantization paths.

The same component-aware workflow can be tested on other text-to-video DiT models, but each architecture needs its own validation.

Model Specifications

Property Value
Base Model zai-org/CogVideoX-2b
Architecture Diffusion Transformer (CogVideoXTransformer3DModel)
Transformer Params 1.69B
Quantization tritplane3 (3-plane progressive ternary)
Components Quantized 245 linear layers in the DiT (attention QKV, cross-attention, FFN, modulation)
Text Encoder (T5) FP16 (preserved)
VAE (3D causal) FP16 (preserved)
License Apache 2.0

Verified Working

Generated videos with the quantized pipeline:

  • Prompt: "a cat walking on green grass"
  • Resolution: 480Γ—720, 9 frames
  • Steps: 25 (recommended; 5 is too few)
  • Seed: 42
  • Device: MPS (Apple Silicon), bfloat16

Output is coherent β€” shows a cat on green grass, natural anatomy, temporal consistency. See test_ternary_25steps.mp4 in the repo.

Quality vs FP16 original: Both produce valid outputs at 25 steps. Per-pixel PSNR is ~13 dB (expected β€” diffusion models produce stochastic outputs, pixel-level comparison is not meaningful between independent runs even with the same seed).

Size & Compression

Method Transformer Size Bits/Weight Compression
FP16 (original) 3.38 GB 16 1.0Γ—
Ternary tritplane3 (theoretical, packed) ~1.69 GB ~8 2.0Γ—
FP16 (as stored in this repo) 3.38 GB 16 1.0Γ— on disk

Honest note: This repo ships the transformer with ternary-quantized weights dequantized back to FP16 for drop-in compatibility with the standard diffusers pipeline. On-disk size matches the original. The weights have ternary precision (~8 effective bits) but are stored in FP16 format. For actual 2Γ— disk compression, weights would need to be saved in ternary-quant's packed tritplane format (requires custom inference wrapper).

This proves the quantization works end-to-end without requiring users to install anything beyond standard diffusers.

Memory Requirements (Inference)

Device Peak Memory Speed (25 steps, 480Γ—720, 9 frames)
Apple Silicon MPS (bfloat16) ~16 GB unified 175s total (7s/step)
NVIDIA CUDA (bfloat16) ~12 GB VRAM ~60s expected (untested)
CPU (bfloat16) ~14 GB RAM Very slow (hours) β€” not recommended

Quickstart

pip install diffusers transformers accelerate tiktoken sentencepiece protobuf imageio imageio-ffmpeg
import torch
from diffusers import DiffusionPipeline
from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained(
    "AsadIsmail/CogVideoX-2b-ternary",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
)

# MPS workaround: cast float64 scheduler buffers to float32
device = "mps"  # or "cuda"
if device == "mps":
    for attr in ("alphas_cumprod", "betas", "alphas", "sigmas"):
        val = getattr(pipe.scheduler, attr, None)
        if torch.is_tensor(val) and val.dtype == torch.float64:
            setattr(pipe.scheduler, attr, val.float())

pipe.to(device)
pipe.enable_attention_slicing()

result = pipe(
    prompt="a cat walking on green grass",
    num_frames=9,
    num_inference_steps=25,   # use 25+, 5 is too few
    guidance_scale=6.0,
    height=480, width=720,
    generator=torch.Generator(device=device).manual_seed(42),
)

export_to_video(result.frames[0], "output.mp4", fps=8)

Limitations

  • Storage not reduced (dequantized FP16 format) β€” see honest note above
  • Compute not accelerated β€” standard FP16 GEMM, no specialized ternary kernels
  • 5-step inference is too aggressive β€” anatomy artifacts (e.g., wrong eye counts). Use 25+ steps.
  • MPS workaround required (float64 scheduler buffers)
  • CPU inference is impractical (3+ hours for 5 frames on M4 Pro)

Reproduce

The quantization pipeline is at github.com/Asad-Ismail/ternary-models/tree/main/video.

# 1. Load pipeline
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained("zai-org/CogVideoX-2b", torch_dtype=torch.float16)

# 2. Extract DiT transformer
transformer = pipe.transformer

# 3. Capture activations on calibration forward pass
# 4. Quantize each nn.Linear using ProgressiveTritPlaneQuantizer
# 5. Replace weights with dequantized versions
# 6. Save pipeline

See the full script: scripts/quantize_dit.py.

Collection

Part of ternary-models β€” ternary-quantized VLMs, multimodal, audio, and video models.

GitHub: github.com/Asad-Ismail/ternary-models | Library: github.com/Asad-Ismail/ternary-quant

Citation

@software{ternary_quant,
  author = {Ismail, Asad},
  title = {ternary-quant: Post-training ternary quantization for HuggingFace generative models},
  url = {https://github.com/Asad-Ismail/ternary-quant},
  year = {2026}
}
Downloads last month
44
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for AsadIsmail/CogVideoX-2b-ternary

Finetuned
(3)
this model

Collection including AsadIsmail/CogVideoX-2b-ternary