Kimi-K2.6 RAM GGUF
Mixed-precision GGUF quantizations of moonshotai/Kimi-K2.6, produced with the RAM (Resource-Aware Mixed-precision) pipeline.
Each variant is also published as its own repo for easier pinning:
- baa-ai/Kimi-K2.6-RAM-344GB-GGUF โ Q2_K experts, lower-footprint
- baa-ai/Kimi-K2.6-RAM-447GB-GGUF โ Q3_K experts, higher-quality
Files
| File | Size | Expert bits | Attention bits | Target hardware |
|---|---|---|---|---|
kimi-k2.6-ram-344gb.gguf |
344 GB | Q2_K | Q5_KโQ8_0 (probe-allocated) | 2ร 192 GB or 1ร 512 GB |
kimi-k2.6-ram-447gb.gguf |
447 GB | Q3_K | Q5_KโQ8_0 (probe-allocated) | 2ร 256 GB or 1ร 512 GB |
Method
Quantization bit depths are assigned per-tensor using sensitivity probing rather than a uniform scheme. Each attention tensor receives bits proportional to how much its output diverges under quantization noise, measured across 8 random probes. Expert tensors (384 routed experts ร 60 MoE layers) are quantized uniformly at Q2_K or Q3_K depending on the target size.
Architecture: DeepSeek-V3 decoder (kimi_k2), 61 layers, 384 routed experts / 8 active per token, MLA attention, hidden_size=7168, 1.03T total parameters.
Usage (llama.cpp)
llama-cli \
-m kimi-k2.6-ram-447gb.gguf \
-c 8192 \
--temp 0.6 \
-p "You are a helpful assistant."
Notes
- No importance matrix (imatrix) was used โ the 1.1 TB Q8_0 intermediate required for imatrix generation exceeds available RAM for inference on this machine. Bit allocation from sensitivity probing provides the primary quality signal.
- The original model uses Neural Magic compressed-tensors INT4 for expert weights. These are dequantized and re-quantized to GGUF format during conversion.
- Downloads last month
- 528
We're not able to determine the quantization variants.
Model tree for baa-ai/Kimi-K2.6-RAM-GGUF
Base model
moonshotai/Kimi-K2.6