GLM-5.1 REAP Family
One landing page for every quant + prune of zai-org/GLM-5.1. Pick the variant that matches your hardware. All are MIT-licensed, REAP-pruned descendants of the same 744 B bf16 base.
Pick yours in 30 seconds
| You have | Use this | Why |
|---|---|---|
| 4ร RTX PRO 6000 Blackwell 96 GB (sm_120) | GLM-5.1-478B-A42B-REAP-NVFP4 | 200 k context, fits with headroom |
| 4ร B200 180 GB (sm_100) | GLM-5.1-478B-A42B-REAP-NVFP4 or -555B-NVFP4 | either fits, 555B = more experts |
| 8ร B200 / datacenter Blackwell | GLM-5.1-555B-A14B-REAP-NVFP4 | upstream reference config (flashinfer + b12x) |
| 8ร H100 / H200 (Hopper, sm_90) | GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 | broad engine support, fast on Hopper |
| 8ร A100 80 GB (Ampere, sm_80) | GLM-5.1-444B-A14B-REAP or -555B-GPTQ-W4A16 | BF16 at 154 experts, or W4A16 at 192 |
| CPU / Mac / consumer GPU (llama.cpp) | GLM-5.1-555B-A14B-REAP-GGUF or -444B-GGUF | multi-quant ladder Q2 โ Q8 |
Not sure which? Go with GLM-5.1-478B-A42B-REAP-NVFP4 if you have Blackwell, GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 if you have Hopper.
Paste-run
One-liner to grab a variant and launch it on sglang:
# pick one of these
VARIANT=0xSero/GLM-5.1-478B-A42B-REAP-NVFP4 # Blackwell, 200k ctx
# VARIANT=0xSero/GLM-5.1-555B-A14B-REAP-NVFP4 # Blackwell datacenter
# VARIANT=0xSero/GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 # Hopper
# VARIANT=0xSero/GLM-5.1-444B-A14B-REAP # Ampere BF16
hf download "$VARIANT" --local-dir "./${VARIANT##*/}"
Each variant's card has its own exact launch recipe. The 478B-A42B card is the most detailed (full pinned pip freeze, mandatory sglang patch, flag-by-flag justification, live performance).
Full variant table
| Variant | Format | Size | Experts / layer | Activated / tok | Engines | Best hardware |
|---|---|---|---|---|---|---|
| GLM-5.1-555B-A14B-REAP | BF16 | ~1125 GB | 192 | ~14 B | sglang, vllm | Hopper, 8ร 141 GB |
| GLM-5.1-444B-A14B-REAP | BF16 | ~910 GB | 154 | ~14 B | sglang, vllm | Ampere/Hopper, 8ร 114 GB |
| GLM-5.1-555B-A14B-REAP-NVFP4 | NVFP4 | ~320 GB | 192 | ~14 B | sglang modelopt_fp4 |
Blackwell (native), Hopper (triton path) |
| GLM-5.1-478B-A42B-REAP-NVFP4 | NVFP4 | ~285 GB | 160 | ~42 B | sglang | 4ร 96 GB Blackwell Workstation, 200k ctx |
| GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 | GPTQ W4A16 | ~297 GB | 192 | ~14 B | vllm, sglang | Hopper (best), Ampere works |
| GLM-5.1-555B-A14B-REAP-GGUF | GGUF (Q2โQ8) | ~348 GB | 192 | ~14 B | llama.cpp | CPU, Apple Silicon, consumer CUDA |
| GLM-5.1-444B-A14B-REAP-GGUF | GGUF (Q2โQ8) | ~283 GB | 154 | ~14 B | llama.cpp | CPU, Apple Silicon, consumer CUDA |
Bold row = latest / long-context-optimized. Note it reports A42B (real measured activated params on the 160-expert MoE) while its 192-expert siblings follow the upstream A14B branding convention.
Technical drill-down
What is REAP pruning?
REAP ranks experts in a MoE model by their token-weighted contribution and drops the lowest contributors. Surviving experts are renumbered in the routing gate; quality loss is minimized compared to magnitude or random pruning.
Both REAP passes in this family were run locally using pooled observations from two instrumented inference runs on GLM-5.1:
0xSero/glm51-layerwise-reap-observationsโ per-block token-weighted metrics, full layer coverage.0xSero/glm-5-specialโ consolidated observer state, ~85 M tokens over ~7.6 k samples.
Provenance (what came from where)
| Stage | Result |
|---|---|
zai-org/GLM-5.1 (upstream) |
744 B bf16, 256 experts/MoE layer |
| BF16 + REAP pass 1 (256โ192) | GLM-5.1-555B-A14B-REAP |
| BF16 + REAP pass 1 (256โ154) | GLM-5.1-444B-A14B-REAP |
| Community NVFP4 of upstream | ~434 GB NVFP4, still 256 experts |
| NVFP4 + REAP pass 1 (256โ192) | GLM-5.1-555B-A14B-REAP-NVFP4 |
| NVFP4 + REAP pass 2 (192โ160) | GLM-5.1-478B-A42B-REAP-NVFP4 |
| W4A16 (GPTQ) of NVFP4-192 | GLM-5.1-555B-A14B-REAP-GPTQ-W4A16 |
| GGUF of BF16-192 / BF16-154 | GLM-5.1-555B-A14B-REAP-GGUF, GLM-5.1-444B-A14B-REAP-GGUF |
All quantizations preserve the REAP expert count of their parent; GGUF and GPTQ variants are re-quantized from the BF16 or NVFP4 base, not from each other.
Format choice: NVFP4 vs GPTQ-W4A16
Both store weights at ~4 bits. Differences:
| Property | NVFP4 | GPTQ-W4A16 |
|---|---|---|
| Weight format | float4 with per-16 FP8 scales | int4 with per-group (128) FP16 scales |
| Native hardware | Blackwell (tensor cores) | Universal int4 GEMM |
| Throughput on Blackwell | higher | medium |
| Throughput on Hopper | medium (via triton) | higher (marlin kernel) |
| Engine support | sglang | sglang, vllm, TGI, tensorrt-llm |
| Accuracy | typically equal within ยฑ0.5 % | typically equal within ยฑ0.5 % |
If you're on Blackwell, prefer NVFP4. If you're on Hopper with vllm or other non-sglang engines, prefer GPTQ-W4A16.
Key numbers (from the 160-expert variant at 4ร 96 GB Blackwell)
| Value | |
|---|---|
| Context | 202,752 tokens |
| Decode tok/s @ 256 ctx | 46.5 |
| Decode tok/s @ 150 k ctx | 22.4 |
| VRAM / rank | 77 GB weights + 11 GB KV pool (fp8) |
| KV pool capacity | 270 k tokens |
Full measured numbers and the sampling-penalty recipe that preserves coherence at long context are on the variant card.
License & citation
All variants inherit MIT from zai-org/GLM-5.1.
Citation for REAP:
@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025},
eprint = {2510.13999},
archivePrefix = {arXiv},
}
Model tree for 0xSero/GLM-5.1-REAP
Base model
zai-org/GLM-5.1