CogVideoX-2B — Ternary Quantized (tritplane3)

Ternary-quantized version of zai-org/CogVideoX-2b.

Produced with ternary-quant using component-aware tritplane3 quantization applied to the Diffusion Transformer (DiT) backbone.

Why this matters

This is a proof-of-concept for ternary post-training quantization on a diffusers-based DiT pipeline. It should be treated as an experimental artifact, not a benchmarked replacement for FP8, int8, or other production video quantization paths.

The same component-aware workflow can be tested on other text-to-video DiT models, but each architecture needs its own validation.

Model Specifications

Property	Value
Base Model	zai-org/CogVideoX-2b
Architecture	Diffusion Transformer (CogVideoXTransformer3DModel)
Transformer Params	1.69B
Quantization	tritplane3 (3-plane progressive ternary)
Components Quantized	245 linear layers in the DiT (attention QKV, cross-attention, FFN, modulation)
Text Encoder (T5)	FP16 (preserved)
VAE (3D causal)	FP16 (preserved)
License	Apache 2.0

Verified Working

Generated videos with the quantized pipeline:

Prompt: "a cat walking on green grass"
Resolution: 480×720, 9 frames
Steps: 25 (recommended; 5 is too few)
Seed: 42
Device: MPS (Apple Silicon), bfloat16

Output is coherent — shows a cat on green grass, natural anatomy, temporal consistency. See test_ternary_25steps.mp4 in the repo.

Quality vs FP16 original: Both produce valid outputs at 25 steps. Per-pixel PSNR is ~13 dB (expected — diffusion models produce stochastic outputs, pixel-level comparison is not meaningful between independent runs even with the same seed).

Size & Compression

Method	Transformer Size	Bits/Weight	Compression
FP16 (original)	3.38 GB	16	1.0×
Ternary tritplane3 (theoretical, packed)	~1.69 GB	~8	2.0×
FP16 (as stored in this repo)	3.38 GB	16	1.0× on disk

Honest note: This repo ships the transformer with ternary-quantized weights dequantized back to FP16 for drop-in compatibility with the standard diffusers pipeline. On-disk size matches the original. The weights have ternary precision (~8 effective bits) but are stored in FP16 format. For actual 2× disk compression, weights would need to be saved in ternary-quant's packed tritplane format (requires custom inference wrapper).

This proves the quantization works end-to-end without requiring users to install anything beyond standard diffusers.

Memory Requirements (Inference)

Device	Peak Memory	Speed (25 steps, 480×720, 9 frames)
Apple Silicon MPS (bfloat16)	~16 GB unified	175s total (7s/step)
NVIDIA CUDA (bfloat16)	~12 GB VRAM	~60s expected (untested)
CPU (bfloat16)	~14 GB RAM	Very slow (hours) — not recommended

Quickstart

pip install diffusers transformers accelerate tiktoken sentencepiece protobuf imageio imageio-ffmpeg

import torch
from diffusers import DiffusionPipeline
from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained(
    "AsadIsmail/CogVideoX-2b-ternary",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
)

# MPS workaround: cast float64 scheduler buffers to float32
device = "mps"  # or "cuda"
if device == "mps":
    for attr in ("alphas_cumprod", "betas", "alphas", "sigmas"):
        val = getattr(pipe.scheduler, attr, None)
        if torch.is_tensor(val) and val.dtype == torch.float64:
            setattr(pipe.scheduler, attr, val.float())

pipe.to(device)
pipe.enable_attention_slicing()

result = pipe(
    prompt="a cat walking on green grass",
    num_frames=9,
    num_inference_steps=25,   # use 25+, 5 is too few
    guidance_scale=6.0,
    height=480, width=720,
    generator=torch.Generator(device=device).manual_seed(42),
)

export_to_video(result.frames[0], "output.mp4", fps=8)

Limitations

Storage not reduced (dequantized FP16 format) — see honest note above
Compute not accelerated — standard FP16 GEMM, no specialized ternary kernels
5-step inference is too aggressive — anatomy artifacts (e.g., wrong eye counts). Use 25+ steps.
MPS workaround required (float64 scheduler buffers)
CPU inference is impractical (3+ hours for 5 frames on M4 Pro)

Reproduce

The quantization pipeline is at github.com/Asad-Ismail/ternary-models/tree/main/video.

# 1. Load pipeline
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained("zai-org/CogVideoX-2b", torch_dtype=torch.float16)

# 2. Extract DiT transformer
transformer = pipe.transformer

# 3. Capture activations on calibration forward pass
# 4. Quantize each nn.Linear using ProgressiveTritPlaneQuantizer
# 5. Replace weights with dequantized versions
# 6. Save pipeline

See the full script: scripts/quantize_dit.py.

Collection

Part of ternary-models — ternary-quantized VLMs, multimodal, audio, and video models.

GitHub: github.com/Asad-Ismail/ternary-models | Library: github.com/Asad-Ismail/ternary-quant

Citation

@software{ternary_quant,
  author = {Ismail, Asad},
  title = {ternary-quant: Post-training ternary quantization for HuggingFace generative models},
  url = {https://github.com/Asad-Ismail/ternary-quant},
  year = {2026}
}

Downloads last month: 44

Model tree for AsadIsmail/CogVideoX-2b-ternary

Base model

zai-org/CogVideoX-2b

Finetuned

(3)

this model

Collection including AsadIsmail/CogVideoX-2b-ternary

ternary-models: VLMs, Multimodal & Audio

Collection

Ternary-quantized models for architectures GGUF can't handle. tritplane3 scheme. • 16 items • Updated 21 days ago • 2