Kimi-K2.6 RAM GGUF

Mixed-precision GGUF quantizations of moonshotai/Kimi-K2.6, produced with the RAM (Resource-Aware Mixed-precision) pipeline.

Each variant is also published as its own repo for easier pinning:

baa-ai/Kimi-K2.6-RAM-344GB-GGUF — Q2_K experts, lower-footprint
baa-ai/Kimi-K2.6-RAM-447GB-GGUF — Q3_K experts, higher-quality

Files

File	Size	Expert bits	Attention bits	Target hardware
`kimi-k2.6-ram-344gb.gguf`	344 GB	Q2_K	Q5_K–Q8_0 (probe-allocated)	2× 192 GB or 1× 512 GB
`kimi-k2.6-ram-447gb.gguf`	447 GB	Q3_K	Q5_K–Q8_0 (probe-allocated)	2× 256 GB or 1× 512 GB

Method

Quantization bit depths are assigned per-tensor using sensitivity probing rather than a uniform scheme. Each attention tensor receives bits proportional to how much its output diverges under quantization noise, measured across 8 random probes. Expert tensors (384 routed experts × 60 MoE layers) are quantized uniformly at Q2_K or Q3_K depending on the target size.

Architecture: DeepSeek-V3 decoder (kimi_k2), 61 layers, 384 routed experts / 8 active per token, MLA attention, hidden_size=7168, 1.03T total parameters.

Usage (llama.cpp)

llama-cli \
  -m kimi-k2.6-ram-447gb.gguf \
  -c 8192 \
  --temp 0.6 \
  -p "You are a helpful assistant."

Notes

No importance matrix (imatrix) was used — the 1.1 TB Q8_0 intermediate required for imatrix generation exceeds available RAM for inference on this machine. Bit allocation from sensitivity probing provides the primary quality signal.
The original model uses Neural Magic compressed-tensors INT4 for expert weights. These are dequantized and re-quantized to GGUF format during conversion.

Downloads last month: 528

GGUF

Model size

1T params

Architecture

deepseek2

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for baa-ai/Kimi-K2.6-RAM-GGUF

Base model

moonshotai/Kimi-K2.6

Quantized

(24)

this model