Kimi-K2.6 RAM GGUF

Mixed-precision GGUF quantizations of moonshotai/Kimi-K2.6, produced with the RAM (Resource-Aware Mixed-precision) pipeline.

Each variant is also published as its own repo for easier pinning:

Files

File Size Expert bits Attention bits Target hardware
kimi-k2.6-ram-344gb.gguf 344 GB Q2_K Q5_Kโ€“Q8_0 (probe-allocated) 2ร— 192 GB or 1ร— 512 GB
kimi-k2.6-ram-447gb.gguf 447 GB Q3_K Q5_Kโ€“Q8_0 (probe-allocated) 2ร— 256 GB or 1ร— 512 GB

Method

Quantization bit depths are assigned per-tensor using sensitivity probing rather than a uniform scheme. Each attention tensor receives bits proportional to how much its output diverges under quantization noise, measured across 8 random probes. Expert tensors (384 routed experts ร— 60 MoE layers) are quantized uniformly at Q2_K or Q3_K depending on the target size.

Architecture: DeepSeek-V3 decoder (kimi_k2), 61 layers, 384 routed experts / 8 active per token, MLA attention, hidden_size=7168, 1.03T total parameters.

Usage (llama.cpp)

llama-cli \
  -m kimi-k2.6-ram-447gb.gguf \
  -c 8192 \
  --temp 0.6 \
  -p "You are a helpful assistant."

Notes

  • No importance matrix (imatrix) was used โ€” the 1.1 TB Q8_0 intermediate required for imatrix generation exceeds available RAM for inference on this machine. Bit allocation from sensitivity probing provides the primary quality signal.
  • The original model uses Neural Magic compressed-tensors INT4 for expert weights. These are dequantized and re-quantized to GGUF format during conversion.
Downloads last month
528
GGUF
Model size
1T params
Architecture
deepseek2
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for baa-ai/Kimi-K2.6-RAM-GGUF

Quantized
(24)
this model