GLM-5.1 REAP Family

One landing page for every quant + prune of zai-org/GLM-5.1. Pick the variant that matches your hardware. All are MIT-licensed, REAP-pruned descendants of the same 744 B bf16 base.

GLM-5.1 REAP family provenance tree

Pick yours in 30 seconds

You have	Use this	Why
4× RTX PRO 6000 Blackwell 96 GB (sm_120)	GLM-5.1-478B-A42B-REAP-NVFP4	200 k context, fits with headroom
4× B200 180 GB (sm_100)	GLM-5.1-478B-A42B-REAP-NVFP4 or -555B-NVFP4	either fits, 555B = more experts
8× B200 / datacenter Blackwell	GLM-5.1-555B-A14B-REAP-NVFP4	upstream reference config (flashinfer + b12x)
8× H100 / H200 (Hopper, sm_90)	GLM-5.1-555B-A14B-REAP-GPTQ-W4A16	broad engine support, fast on Hopper
8× A100 80 GB (Ampere, sm_80)	GLM-5.1-444B-A14B-REAP or -555B-GPTQ-W4A16	BF16 at 154 experts, or W4A16 at 192
CPU / Mac / consumer GPU (llama.cpp)	GLM-5.1-555B-A14B-REAP-GGUF or -444B-GGUF	multi-quant ladder Q2 → Q8

Not sure which? Go with GLM-5.1-478B-A42B-REAP-NVFP4 if you have Blackwell, GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 if you have Hopper.

Paste-run

One-liner to grab a variant and launch it on sglang:

# pick one of these
VARIANT=0xSero/GLM-5.1-478B-A42B-REAP-NVFP4            # Blackwell, 200k ctx
# VARIANT=0xSero/GLM-5.1-555B-A14B-REAP-NVFP4          # Blackwell datacenter
# VARIANT=0xSero/GLM-5.1-555B-A14B-REAP-GPTQ-W4A16     # Hopper
# VARIANT=0xSero/GLM-5.1-444B-A14B-REAP                # Ampere BF16

hf download "$VARIANT" --local-dir "./${VARIANT##*/}"

Each variant's card has its own exact launch recipe. The 478B-A42B card is the most detailed (full pinned pip freeze, mandatory sglang patch, flag-by-flag justification, live performance).

Full variant table

Variant	Format	Size	Experts / layer	Activated / tok	Engines	Best hardware
GLM-5.1-555B-A14B-REAP	BF16	~1125 GB	192	~14 B	sglang, vllm	Hopper, 8× 141 GB
GLM-5.1-444B-A14B-REAP	BF16	~910 GB	154	~14 B	sglang, vllm	Ampere/Hopper, 8× 114 GB
GLM-5.1-555B-A14B-REAP-NVFP4	NVFP4	~320 GB	192	~14 B	sglang `modelopt_fp4`	Blackwell (native), Hopper (triton path)
GLM-5.1-478B-A42B-REAP-NVFP4	NVFP4	~285 GB	160	~42 B	sglang	4× 96 GB Blackwell Workstation, 200k ctx
GLM-5.1-555B-A14B-REAP-GPTQ-W4A16	GPTQ W4A16	~297 GB	192	~14 B	vllm, sglang	Hopper (best), Ampere works
GLM-5.1-555B-A14B-REAP-GGUF	GGUF (Q2–Q8)	~348 GB	192	~14 B	llama.cpp	CPU, Apple Silicon, consumer CUDA
GLM-5.1-444B-A14B-REAP-GGUF	GGUF (Q2–Q8)	~283 GB	154	~14 B	llama.cpp	CPU, Apple Silicon, consumer CUDA

Bold row = latest / long-context-optimized. Note it reports A42B (real measured activated params on the 160-expert MoE) while its 192-expert siblings follow the upstream A14B branding convention.

Technical drill-down

What is REAP pruning?

REAP ranks experts in a MoE model by their token-weighted contribution and drops the lowest contributors. Surviving experts are renumbered in the routing gate; quality loss is minimized compared to magnitude or random pruning.

Both REAP passes in this family were run locally using pooled observations from two instrumented inference runs on GLM-5.1:

0xSero/glm51-layerwise-reap-observations — per-block token-weighted metrics, full layer coverage.
0xSero/glm-5-special — consolidated observer state, ~85 M tokens over ~7.6 k samples.

Provenance (what came from where)

Stage	Result
`zai-org/GLM-5.1` (upstream)	744 B bf16, 256 experts/MoE layer
BF16 + REAP pass 1 (256→192)	`GLM-5.1-555B-A14B-REAP`
BF16 + REAP pass 1 (256→154)	`GLM-5.1-444B-A14B-REAP`
Community NVFP4 of upstream	~434 GB NVFP4, still 256 experts
NVFP4 + REAP pass 1 (256→192)	`GLM-5.1-555B-A14B-REAP-NVFP4`
NVFP4 + REAP pass 2 (192→160)	`GLM-5.1-478B-A42B-REAP-NVFP4`
W4A16 (GPTQ) of NVFP4-192	`GLM-5.1-555B-A14B-REAP-GPTQ-W4A16`
GGUF of BF16-192 / BF16-154	`GLM-5.1-555B-A14B-REAP-GGUF`, `GLM-5.1-444B-A14B-REAP-GGUF`

All quantizations preserve the REAP expert count of their parent; GGUF and GPTQ variants are re-quantized from the BF16 or NVFP4 base, not from each other.

Format choice: NVFP4 vs GPTQ-W4A16

Both store weights at ~4 bits. Differences:

Property	NVFP4	GPTQ-W4A16
Weight format	float4 with per-16 FP8 scales	int4 with per-group (128) FP16 scales
Native hardware	Blackwell (tensor cores)	Universal int4 GEMM
Throughput on Blackwell	higher	medium
Throughput on Hopper	medium (via triton)	higher (marlin kernel)
Engine support	sglang	sglang, vllm, TGI, tensorrt-llm
Accuracy	typically equal within ±0.5 %	typically equal within ±0.5 %

If you're on Blackwell, prefer NVFP4. If you're on Hopper with vllm or other non-sglang engines, prefer GPTQ-W4A16.

Key numbers (from the 160-expert variant at 4× 96 GB Blackwell)

	Value
Context	202,752 tokens
Decode tok/s @ 256 ctx	46.5
Decode tok/s @ 150 k ctx	22.4
VRAM / rank	77 GB weights + 11 GB KV pool (fp8)
KV pool capacity	270 k tokens

Full measured numbers and the sampling-penalty recipe that preserves coherence at long context are on the variant card.

License & citation

All variants inherit MIT from zai-org/GLM-5.1.

Citation for REAP:

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025},
  eprint = {2510.13999},
  archivePrefix = {arXiv},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for 0xSero/GLM-5.1-REAP

Base model

zai-org/GLM-5.1

Finetuned

(9)

this model

Paper for 0xSero/GLM-5.1-REAP

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 19