liushaowei commited on
Commit
6dead1a
·
1 Parent(s): a060a3e

update deploy guidance

Browse files
Files changed (1) hide show
  1. docs/deploy_guidance.md +35 -20
docs/deploy_guidance.md CHANGED
@@ -58,36 +58,51 @@ python -m sglang.launch_server --model-path $MODEL_PATH --tp 8 --trust-remote-co
58
 
59
  ## KTransformers Deployment
60
 
61
- ### Environments
62
- 1. Follow the official SGLang installation guide to install SGLang:
63
 
64
- ``` bash
65
- pip install "sglang[all]"
66
- ```
67
- 2. Install KTransformers CPU Kernels
68
 
69
- The KTransformers CPU kernels (kt-kernel) provide AMX-optimized computation for hybrid inference, for detailed installation instructions and troubleshooting, refer to [the official kt-kernel installation guide](https://github.com/kvcache-ai/ktransformers/blob/main/kt-kernel/README.md).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
 
71
- 3. Download Model
72
 
73
- Download the official KIMI weights as GPU weights.
74
- Download the AMX INT4 quantized weights provided by Approaching AI [coming soon] as CPU weights.
75
 
76
- ### Inference
77
 
78
- ``` bash
79
- python -m sglang.launch_server --host 0.0.0.0 --port 60000 --model /mnt/data3/models/Kimi-K2-Thinking/ --kt-amx-weight-path /mnt/data3/models/Kimi-K2-Instruct-CPU-weight/ --kt-cpuinfer 56 --kt-threadpool-count 2 --kt-num-gpu-experts 200 --kt-amx-method AMXINT4 --attention-backend triton --trust-remote-code --mem-fraction-static 0.98 --chunked-prefill-size 4096 --max-running-requests 37 --max-total-tokens 37000 --enable-mixed-chunk --tensor-parallel-size 8 --enable-p2p-check --disable-shared-experts-fusion
80
- ```
81
- ``` bash
82
- python ktransformers/server/main.py --model_path /path/to/K2 --gguf_path /path/to/K2 --cache_lens 30000
83
- ```
84
 
85
- To enable AMX optimization, run:
86
 
87
- ``` bash
88
- python ktransformers/server/main.py --model_path /path/to/K2 --gguf_path /path/to/K2 --cache_lens 30000 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-fp8-linear-ggml-experts-serve-amx.yaml
 
 
 
 
 
89
  ```
90
 
 
 
 
 
91
  ## Others
92
 
93
  Kimi-K2-Thinking reuses the `DeepSeekV3CausalLM` architecture and convert it's weight into proper shape to save redevelopment effort. To let inference engines distinguish it from DeepSeek-V3 and apply the best optimizations, we set `"model_type": "kimi_k2"` in `config.json`.
 
58
 
59
  ## KTransformers Deployment
60
 
61
+ ### KTransformers+SGLang Inference Deployment
 
62
 
63
+ Launch with KTransformers + SGLang for CPU+GPU heterogeneous inference:
 
 
 
64
 
65
+ ``` bash
66
+ python -m sglang.launch_server \
67
+ --model path/to/Kimi-K2-Thinking/ \
68
+ --kt-amx-weight-path path/to/Kimi-K2-Instruct-CPU-weight/ \
69
+ --kt-cpuinfer 56 \
70
+ --kt-threadpool-count 2 \
71
+ --kt-num-gpu-experts 200 \
72
+ --kt-amx-method AMXINT4 \
73
+ --trust-remote-code \
74
+ --mem-fraction-static 0.98 \
75
+ --chunked-prefill-size 4096 \
76
+ --max-running-requests 37 \
77
+ --max-total-tokens 37000 \
78
+ --enable-mixed-chunk \
79
+ --tensor-parallel-size 8 \
80
+ --enable-p2p-check \
81
+ --disable-shared-experts-fusion
82
+ ```
83
 
84
+ Achieves 577.74 tokens/s Prefill and 45.91 tokens/s Decode (37-way concurrency) on 8× NVIDIA L20 + 2× Intel 6454S.
85
 
86
+ More details: https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/Kimi-K2-Thinking.md
 
87
 
 
88
 
89
+ ### KTransformers+LLaMA-Factory Fine-tuning Deployment
 
 
 
 
 
90
 
91
+ You can use below command to run LoRA SFT with KT+llamafactory.
92
 
93
+ ``` bash
94
+ # For LoRA SFT
95
+ USE_KT=1 llamafactory-cli train examples/train_lora/kimik2_lora_sft_kt.yaml
96
+ # For Chat with model after LoRA SFT
97
+ llamafactory-cli chat examples/inference/kimik2_lora_sft_kt.yaml
98
+ # For API with model after LoRA SFT
99
+ llamafactory-cli api examples/inference/kimik2_lora_sft_kt.yaml
100
  ```
101
 
102
+ This achieves end-to-end LoRA SFT Throughput: 46.55 token/s on 2× NVIDIA 4090 + Intel 8488C with 1.97T RAM and 200G swap memory.
103
+
104
+ More details refer to https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/SFT_Installation_Guide_KimiK2.md.
105
+
106
  ## Others
107
 
108
  Kimi-K2-Thinking reuses the `DeepSeekV3CausalLM` architecture and convert it's weight into proper shape to save redevelopment effort. To let inference engines distinguish it from DeepSeek-V3 and apply the best optimizations, we set `"model_type": "kimi_k2"` in `config.json`.