liushaowei
commited on
Commit
·
6dead1a
1
Parent(s):
a060a3e
update deploy guidance
Browse files- docs/deploy_guidance.md +35 -20
docs/deploy_guidance.md
CHANGED
|
@@ -58,36 +58,51 @@ python -m sglang.launch_server --model-path $MODEL_PATH --tp 8 --trust-remote-co
|
|
| 58 |
|
| 59 |
## KTransformers Deployment
|
| 60 |
|
| 61 |
-
###
|
| 62 |
-
1. Follow the official SGLang installation guide to install SGLang:
|
| 63 |
|
| 64 |
-
|
| 65 |
-
pip install "sglang[all]"
|
| 66 |
-
```
|
| 67 |
-
2. Install KTransformers CPU Kernels
|
| 68 |
|
| 69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
-
|
| 72 |
|
| 73 |
-
|
| 74 |
-
Download the AMX INT4 quantized weights provided by Approaching AI [coming soon] as CPU weights.
|
| 75 |
|
| 76 |
-
### Inference
|
| 77 |
|
| 78 |
-
|
| 79 |
-
python -m sglang.launch_server --host 0.0.0.0 --port 60000 --model /mnt/data3/models/Kimi-K2-Thinking/ --kt-amx-weight-path /mnt/data3/models/Kimi-K2-Instruct-CPU-weight/ --kt-cpuinfer 56 --kt-threadpool-count 2 --kt-num-gpu-experts 200 --kt-amx-method AMXINT4 --attention-backend triton --trust-remote-code --mem-fraction-static 0.98 --chunked-prefill-size 4096 --max-running-requests 37 --max-total-tokens 37000 --enable-mixed-chunk --tensor-parallel-size 8 --enable-p2p-check --disable-shared-experts-fusion
|
| 80 |
-
```
|
| 81 |
-
``` bash
|
| 82 |
-
python ktransformers/server/main.py --model_path /path/to/K2 --gguf_path /path/to/K2 --cache_lens 30000
|
| 83 |
-
```
|
| 84 |
|
| 85 |
-
|
| 86 |
|
| 87 |
-
``` bash
|
| 88 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
```
|
| 90 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
## Others
|
| 92 |
|
| 93 |
Kimi-K2-Thinking reuses the `DeepSeekV3CausalLM` architecture and convert it's weight into proper shape to save redevelopment effort. To let inference engines distinguish it from DeepSeek-V3 and apply the best optimizations, we set `"model_type": "kimi_k2"` in `config.json`.
|
|
|
|
| 58 |
|
| 59 |
## KTransformers Deployment
|
| 60 |
|
| 61 |
+
### KTransformers+SGLang Inference Deployment
|
|
|
|
| 62 |
|
| 63 |
+
Launch with KTransformers + SGLang for CPU+GPU heterogeneous inference:
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
+
``` bash
|
| 66 |
+
python -m sglang.launch_server \
|
| 67 |
+
--model path/to/Kimi-K2-Thinking/ \
|
| 68 |
+
--kt-amx-weight-path path/to/Kimi-K2-Instruct-CPU-weight/ \
|
| 69 |
+
--kt-cpuinfer 56 \
|
| 70 |
+
--kt-threadpool-count 2 \
|
| 71 |
+
--kt-num-gpu-experts 200 \
|
| 72 |
+
--kt-amx-method AMXINT4 \
|
| 73 |
+
--trust-remote-code \
|
| 74 |
+
--mem-fraction-static 0.98 \
|
| 75 |
+
--chunked-prefill-size 4096 \
|
| 76 |
+
--max-running-requests 37 \
|
| 77 |
+
--max-total-tokens 37000 \
|
| 78 |
+
--enable-mixed-chunk \
|
| 79 |
+
--tensor-parallel-size 8 \
|
| 80 |
+
--enable-p2p-check \
|
| 81 |
+
--disable-shared-experts-fusion
|
| 82 |
+
```
|
| 83 |
|
| 84 |
+
Achieves 577.74 tokens/s Prefill and 45.91 tokens/s Decode (37-way concurrency) on 8× NVIDIA L20 + 2× Intel 6454S.
|
| 85 |
|
| 86 |
+
More details: https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/Kimi-K2-Thinking.md
|
|
|
|
| 87 |
|
|
|
|
| 88 |
|
| 89 |
+
### KTransformers+LLaMA-Factory Fine-tuning Deployment
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
+
You can use below command to run LoRA SFT with KT+llamafactory.
|
| 92 |
|
| 93 |
+
``` bash
|
| 94 |
+
# For LoRA SFT
|
| 95 |
+
USE_KT=1 llamafactory-cli train examples/train_lora/kimik2_lora_sft_kt.yaml
|
| 96 |
+
# For Chat with model after LoRA SFT
|
| 97 |
+
llamafactory-cli chat examples/inference/kimik2_lora_sft_kt.yaml
|
| 98 |
+
# For API with model after LoRA SFT
|
| 99 |
+
llamafactory-cli api examples/inference/kimik2_lora_sft_kt.yaml
|
| 100 |
```
|
| 101 |
|
| 102 |
+
This achieves end-to-end LoRA SFT Throughput: 46.55 token/s on 2× NVIDIA 4090 + Intel 8488C with 1.97T RAM and 200G swap memory.
|
| 103 |
+
|
| 104 |
+
More details refer to https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/SFT_Installation_Guide_KimiK2.md.
|
| 105 |
+
|
| 106 |
## Others
|
| 107 |
|
| 108 |
Kimi-K2-Thinking reuses the `DeepSeekV3CausalLM` architecture and convert it's weight into proper shape to save redevelopment effort. To let inference engines distinguish it from DeepSeek-V3 and apply the best optimizations, we set `"model_type": "kimi_k2"` in `config.json`.
|