CogVideoX-2B β Ternary Quantized (tritplane3)
Ternary-quantized version of zai-org/CogVideoX-2b.
Produced with ternary-quant using component-aware tritplane3 quantization applied to the Diffusion Transformer (DiT) backbone.
Why this matters
This is a proof-of-concept for ternary post-training quantization on a diffusers-based DiT pipeline. It should be treated as an experimental artifact, not a benchmarked replacement for FP8, int8, or other production video quantization paths.
The same component-aware workflow can be tested on other text-to-video DiT models, but each architecture needs its own validation.
Model Specifications
| Property | Value |
|---|---|
| Base Model | zai-org/CogVideoX-2b |
| Architecture | Diffusion Transformer (CogVideoXTransformer3DModel) |
| Transformer Params | 1.69B |
| Quantization | tritplane3 (3-plane progressive ternary) |
| Components Quantized | 245 linear layers in the DiT (attention QKV, cross-attention, FFN, modulation) |
| Text Encoder (T5) | FP16 (preserved) |
| VAE (3D causal) | FP16 (preserved) |
| License | Apache 2.0 |
Verified Working
Generated videos with the quantized pipeline:
- Prompt: "a cat walking on green grass"
- Resolution: 480Γ720, 9 frames
- Steps: 25 (recommended; 5 is too few)
- Seed: 42
- Device: MPS (Apple Silicon), bfloat16
Output is coherent β shows a cat on green grass, natural anatomy, temporal consistency. See test_ternary_25steps.mp4 in the repo.
Quality vs FP16 original: Both produce valid outputs at 25 steps. Per-pixel PSNR is ~13 dB (expected β diffusion models produce stochastic outputs, pixel-level comparison is not meaningful between independent runs even with the same seed).
Size & Compression
| Method | Transformer Size | Bits/Weight | Compression |
|---|---|---|---|
| FP16 (original) | 3.38 GB | 16 | 1.0Γ |
| Ternary tritplane3 (theoretical, packed) | ~1.69 GB | ~8 | 2.0Γ |
| FP16 (as stored in this repo) | 3.38 GB | 16 | 1.0Γ on disk |
Honest note: This repo ships the transformer with ternary-quantized weights dequantized back to FP16 for drop-in compatibility with the standard diffusers pipeline. On-disk size matches the original. The weights have ternary precision (~8 effective bits) but are stored in FP16 format. For actual 2Γ disk compression, weights would need to be saved in ternary-quant's packed tritplane format (requires custom inference wrapper).
This proves the quantization works end-to-end without requiring users to install anything beyond standard diffusers.
Memory Requirements (Inference)
| Device | Peak Memory | Speed (25 steps, 480Γ720, 9 frames) |
|---|---|---|
| Apple Silicon MPS (bfloat16) | ~16 GB unified | 175s total (7s/step) |
| NVIDIA CUDA (bfloat16) | ~12 GB VRAM | ~60s expected (untested) |
| CPU (bfloat16) | ~14 GB RAM | Very slow (hours) β not recommended |
Quickstart
pip install diffusers transformers accelerate tiktoken sentencepiece protobuf imageio imageio-ffmpeg
import torch
from diffusers import DiffusionPipeline
from diffusers.utils import export_to_video
pipe = DiffusionPipeline.from_pretrained(
"AsadIsmail/CogVideoX-2b-ternary",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
)
# MPS workaround: cast float64 scheduler buffers to float32
device = "mps" # or "cuda"
if device == "mps":
for attr in ("alphas_cumprod", "betas", "alphas", "sigmas"):
val = getattr(pipe.scheduler, attr, None)
if torch.is_tensor(val) and val.dtype == torch.float64:
setattr(pipe.scheduler, attr, val.float())
pipe.to(device)
pipe.enable_attention_slicing()
result = pipe(
prompt="a cat walking on green grass",
num_frames=9,
num_inference_steps=25, # use 25+, 5 is too few
guidance_scale=6.0,
height=480, width=720,
generator=torch.Generator(device=device).manual_seed(42),
)
export_to_video(result.frames[0], "output.mp4", fps=8)
Limitations
- Storage not reduced (dequantized FP16 format) β see honest note above
- Compute not accelerated β standard FP16 GEMM, no specialized ternary kernels
- 5-step inference is too aggressive β anatomy artifacts (e.g., wrong eye counts). Use 25+ steps.
- MPS workaround required (float64 scheduler buffers)
- CPU inference is impractical (3+ hours for 5 frames on M4 Pro)
Reproduce
The quantization pipeline is at github.com/Asad-Ismail/ternary-models/tree/main/video.
# 1. Load pipeline
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained("zai-org/CogVideoX-2b", torch_dtype=torch.float16)
# 2. Extract DiT transformer
transformer = pipe.transformer
# 3. Capture activations on calibration forward pass
# 4. Quantize each nn.Linear using ProgressiveTritPlaneQuantizer
# 5. Replace weights with dequantized versions
# 6. Save pipeline
See the full script: scripts/quantize_dit.py.
Collection
Part of ternary-models β ternary-quantized VLMs, multimodal, audio, and video models.
GitHub: github.com/Asad-Ismail/ternary-models | Library: github.com/Asad-Ismail/ternary-quant
Citation
@software{ternary_quant,
author = {Ismail, Asad},
title = {ternary-quant: Post-training ternary quantization for HuggingFace generative models},
url = {https://github.com/Asad-Ismail/ternary-quant},
year = {2026}
}
- Downloads last month
- 44
Model tree for AsadIsmail/CogVideoX-2b-ternary
Base model
zai-org/CogVideoX-2b