GLM-5.1 REAP Family

One landing page for every quant + prune of zai-org/GLM-5.1. Pick the variant that matches your hardware. All are MIT-licensed, REAP-pruned descendants of the same 744 B bf16 base.

GLM-5.1 REAP family provenance tree

Pick yours in 30 seconds

You have Use this Why
4ร— RTX PRO 6000 Blackwell 96 GB (sm_120) GLM-5.1-478B-A42B-REAP-NVFP4 200 k context, fits with headroom
4ร— B200 180 GB (sm_100) GLM-5.1-478B-A42B-REAP-NVFP4 or -555B-NVFP4 either fits, 555B = more experts
8ร— B200 / datacenter Blackwell GLM-5.1-555B-A14B-REAP-NVFP4 upstream reference config (flashinfer + b12x)
8ร— H100 / H200 (Hopper, sm_90) GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 broad engine support, fast on Hopper
8ร— A100 80 GB (Ampere, sm_80) GLM-5.1-444B-A14B-REAP or -555B-GPTQ-W4A16 BF16 at 154 experts, or W4A16 at 192
CPU / Mac / consumer GPU (llama.cpp) GLM-5.1-555B-A14B-REAP-GGUF or -444B-GGUF multi-quant ladder Q2 โ†’ Q8

Not sure which? Go with GLM-5.1-478B-A42B-REAP-NVFP4 if you have Blackwell, GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 if you have Hopper.


Paste-run

One-liner to grab a variant and launch it on sglang:

# pick one of these
VARIANT=0xSero/GLM-5.1-478B-A42B-REAP-NVFP4            # Blackwell, 200k ctx
# VARIANT=0xSero/GLM-5.1-555B-A14B-REAP-NVFP4          # Blackwell datacenter
# VARIANT=0xSero/GLM-5.1-555B-A14B-REAP-GPTQ-W4A16     # Hopper
# VARIANT=0xSero/GLM-5.1-444B-A14B-REAP                # Ampere BF16

hf download "$VARIANT" --local-dir "./${VARIANT##*/}"

Each variant's card has its own exact launch recipe. The 478B-A42B card is the most detailed (full pinned pip freeze, mandatory sglang patch, flag-by-flag justification, live performance).


Full variant table

Variant Format Size Experts / layer Activated / tok Engines Best hardware
GLM-5.1-555B-A14B-REAP BF16 ~1125 GB 192 ~14 B sglang, vllm Hopper, 8ร— 141 GB
GLM-5.1-444B-A14B-REAP BF16 ~910 GB 154 ~14 B sglang, vllm Ampere/Hopper, 8ร— 114 GB
GLM-5.1-555B-A14B-REAP-NVFP4 NVFP4 ~320 GB 192 ~14 B sglang modelopt_fp4 Blackwell (native), Hopper (triton path)
GLM-5.1-478B-A42B-REAP-NVFP4 NVFP4 ~285 GB 160 ~42 B sglang 4ร— 96 GB Blackwell Workstation, 200k ctx
GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 GPTQ W4A16 ~297 GB 192 ~14 B vllm, sglang Hopper (best), Ampere works
GLM-5.1-555B-A14B-REAP-GGUF GGUF (Q2โ€“Q8) ~348 GB 192 ~14 B llama.cpp CPU, Apple Silicon, consumer CUDA
GLM-5.1-444B-A14B-REAP-GGUF GGUF (Q2โ€“Q8) ~283 GB 154 ~14 B llama.cpp CPU, Apple Silicon, consumer CUDA

Bold row = latest / long-context-optimized. Note it reports A42B (real measured activated params on the 160-expert MoE) while its 192-expert siblings follow the upstream A14B branding convention.


Technical drill-down

What is REAP pruning?

REAP ranks experts in a MoE model by their token-weighted contribution and drops the lowest contributors. Surviving experts are renumbered in the routing gate; quality loss is minimized compared to magnitude or random pruning.

Both REAP passes in this family were run locally using pooled observations from two instrumented inference runs on GLM-5.1:

Provenance (what came from where)

Stage Result
zai-org/GLM-5.1 (upstream) 744 B bf16, 256 experts/MoE layer
BF16 + REAP pass 1 (256โ†’192) GLM-5.1-555B-A14B-REAP
BF16 + REAP pass 1 (256โ†’154) GLM-5.1-444B-A14B-REAP
Community NVFP4 of upstream ~434 GB NVFP4, still 256 experts
NVFP4 + REAP pass 1 (256โ†’192) GLM-5.1-555B-A14B-REAP-NVFP4
NVFP4 + REAP pass 2 (192โ†’160) GLM-5.1-478B-A42B-REAP-NVFP4
W4A16 (GPTQ) of NVFP4-192 GLM-5.1-555B-A14B-REAP-GPTQ-W4A16
GGUF of BF16-192 / BF16-154 GLM-5.1-555B-A14B-REAP-GGUF, GLM-5.1-444B-A14B-REAP-GGUF

All quantizations preserve the REAP expert count of their parent; GGUF and GPTQ variants are re-quantized from the BF16 or NVFP4 base, not from each other.

Format choice: NVFP4 vs GPTQ-W4A16

Both store weights at ~4 bits. Differences:

Property NVFP4 GPTQ-W4A16
Weight format float4 with per-16 FP8 scales int4 with per-group (128) FP16 scales
Native hardware Blackwell (tensor cores) Universal int4 GEMM
Throughput on Blackwell higher medium
Throughput on Hopper medium (via triton) higher (marlin kernel)
Engine support sglang sglang, vllm, TGI, tensorrt-llm
Accuracy typically equal within ยฑ0.5 % typically equal within ยฑ0.5 %

If you're on Blackwell, prefer NVFP4. If you're on Hopper with vllm or other non-sglang engines, prefer GPTQ-W4A16.

Key numbers (from the 160-expert variant at 4ร— 96 GB Blackwell)

Value
Context 202,752 tokens
Decode tok/s @ 256 ctx 46.5
Decode tok/s @ 150 k ctx 22.4
VRAM / rank 77 GB weights + 11 GB KV pool (fp8)
KV pool capacity 270 k tokens

Full measured numbers and the sampling-penalty recipe that preserves coherence at long context are on the variant card.


License & citation

All variants inherit MIT from zai-org/GLM-5.1.

Citation for REAP:

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025},
  eprint = {2510.13999},
  archivePrefix = {arXiv},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for 0xSero/GLM-5.1-REAP

Base model

zai-org/GLM-5.1
Finetuned
(9)
this model

Paper for 0xSero/GLM-5.1-REAP