moonshotai
/

Kimi-K2-Thinking

@@ -58,36 +58,51 @@ python -m sglang.launch_server --model-path $MODEL_PATH --tp 8 --trust-remote-co
 ## KTransformers Deployment
-### Environments
-1. Follow the official SGLang installation guide to install SGLang:
-    ``` bash
-    pip install "sglang[all]"
-    ```
-2. Install KTransformers CPU Kernels
-    The KTransformers CPU kernels (kt-kernel) provide AMX-optimized computation for hybrid inference, for detailed installation instructions and troubleshooting, refer to [the official kt-kernel installation guide](https://github.com/kvcache-ai/ktransformers/blob/main/kt-kernel/README.md).
-3. Download Model
-    Download the official KIMI weights as GPU weights.
-    Download the AMX INT4 quantized weights provided by Approaching AI [coming soon] as CPU weights.
-### Inference
-``` bash
-python -m sglang.launch_server   --host 0.0.0.0   --port 60000   --model /mnt/data3/models/Kimi-K2-Thinking/   --kt-amx-weight-path /mnt/data3/models/Kimi-K2-Instruct-CPU-weight/   --kt-cpuinfer 56   --kt-threadpool-count 2   --kt-num-gpu-experts 200   --kt-amx-method AMXINT4   --attention-backend triton   --trust-remote-code   --mem-fraction-static 0.98   --chunked-prefill-size 4096   --max-running-requests 37   --max-total-tokens 37000   --enable-mixed-chunk   --tensor-parallel-size 8   --enable-p2p-check   --disable-shared-experts-fusion
-```
-``` bash
-python ktransformers/server/main.py  --model_path /path/to/K2 --gguf_path /path/to/K2 --cache_lens 30000
-```
-To enable AMX optimization, run:
-``` bash
-python ktransformers/server/main.py  --model_path /path/to/K2 --gguf_path /path/to/K2 --cache_lens 30000 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-fp8-linear-ggml-experts-serve-amx.yaml
 ```
 ## Others
 Kimi-K2-Thinking reuses the `DeepSeekV3CausalLM` architecture and convert it's weight into proper shape to save redevelopment effort. To let inference engines distinguish it from DeepSeek-V3 and apply the best optimizations, we set `"model_type": "kimi_k2"` in `config.json`.

 ## KTransformers Deployment
+### KTransformers+SGLang Inference Deployment
+Launch with KTransformers + SGLang for CPU+GPU heterogeneous inference:
+``` bash
+python -m sglang.launch_server \
+  --model path/to/Kimi-K2-Thinking/ \
+  --kt-amx-weight-path path/to/Kimi-K2-Instruct-CPU-weight/ \
+  --kt-cpuinfer 56 \
+  --kt-threadpool-count 2 \
+  --kt-num-gpu-experts 200 \
+  --kt-amx-method AMXINT4 \
+  --trust-remote-code \
+  --mem-fraction-static 0.98 \
+  --chunked-prefill-size 4096 \
+  --max-running-requests 37 \
+  --max-total-tokens 37000 \
+  --enable-mixed-chunk \
+  --tensor-parallel-size 8 \
+  --enable-p2p-check \
+  --disable-shared-experts-fusion
+```
+Achieves 577.74 tokens/s Prefill and 45.91 tokens/s Decode (37-way concurrency) on 8× NVIDIA L20 + 2× Intel 6454S.
+More details: https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/Kimi-K2-Thinking.md
+### KTransformers+LLaMA-Factory Fine-tuning Deployment
+You can use below command to run LoRA SFT with KT+llamafactory.
+``` bash
+# For LoRA SFT
+USE_KT=1 llamafactory-cli train examples/train_lora/kimik2_lora_sft_kt.yaml
+# For Chat with model after LoRA SFT
+llamafactory-cli chat examples/inference/kimik2_lora_sft_kt.yaml
+# For API with model after LoRA SFT
+llamafactory-cli api examples/inference/kimik2_lora_sft_kt.yaml
 ```
+This achieves end-to-end LoRA SFT Throughput: 46.55 token/s on 2× NVIDIA 4090 + Intel 8488C with 1.97T RAM and 200G swap memory.
+More details refer to https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/SFT_Installation_Guide_KimiK2.md.
 ## Others
 Kimi-K2-Thinking reuses the `DeepSeekV3CausalLM` architecture and convert it's weight into proper shape to save redevelopment effort. To let inference engines distinguish it from DeepSeek-V3 and apply the best optimizations, we set `"model_type": "kimi_k2"` in `config.json`.