liushaowei commited on
Commit
c9caddd
·
1 Parent(s): f1cdb45
Files changed (2) hide show
  1. README.md +8 -6
  2. docs/deploy_guidance.md +2 -4
README.md CHANGED
@@ -26,7 +26,7 @@ library_name: transformers
26
  </div>
27
 
28
  <p align="center">
29
- <b>📰&nbsp;&nbsp;<a href="https://moonshotai.github.io/Kimi-K2-Thinking/">Tech Blog</a></b>
30
  </p>
31
 
32
 
@@ -167,19 +167,21 @@ Deployment examples can be found in the [Model Deployment Guide](docs/deploy_gui
167
  Once the local inference service is up, you can interact with it through the chat endpoint:
168
 
169
  ```python
170
- def simple_chat(client: OpenAI, model_name: str):
171
  messages = [
172
  {"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
173
- {"role": "user", "content": [{"type": "text", "text": "Please give a brief self-introduction."}]},
174
  ]
175
  response = client.chat.completions.create(
176
  model=model_name,
177
  messages=messages,
178
  stream=False,
179
  temperature=1.0,
180
- max_tokens=256
181
  )
182
- print(response.choices[0].message.content)
 
 
183
  ```
184
 
185
  > [!NOTE]
@@ -232,7 +234,7 @@ def tool_call_with_client(client: OpenAI, model_name: str):
232
  completion = client.chat.completions.create(
233
  model=model_name,
234
  messages=messages,
235
- temperature=0.6,
236
  tools=tools, # tool list defined above
237
  tool_choice="auto"
238
  )
 
26
  </div>
27
 
28
  <p align="center">
29
+ <b>📰&nbsp;&nbsp;<a href="https://moonshotai.github.io/Kimi-K2/thinking.html">Tech Blog</a></b>
30
  </p>
31
 
32
 
 
167
  Once the local inference service is up, you can interact with it through the chat endpoint:
168
 
169
  ```python
170
+ def simple_chat(client: openai.OpenAI, model_name: str):
171
  messages = [
172
  {"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
173
+ {"role": "user", "content": [{"type": "text", "text": "which one is bigger, 9.11 or 9.9? think carefully."}]},
174
  ]
175
  response = client.chat.completions.create(
176
  model=model_name,
177
  messages=messages,
178
  stream=False,
179
  temperature=1.0,
180
+ max_tokens=4096
181
  )
182
+ print(f"k2 answer: {response.choices[0].message.content}")
183
+ print("=====below is reasoning content======")
184
+ print(f"reasoning content: {response.choices[0].message.reasoning_content}")
185
  ```
186
 
187
  > [!NOTE]
 
234
  completion = client.chat.completions.create(
235
  model=model_name,
236
  messages=messages,
237
+ temperature=1.0,
238
  tools=tools, # tool list defined above
239
  tool_choice="auto"
240
  )
docs/deploy_guidance.md CHANGED
@@ -3,13 +3,14 @@
3
  > [!Note]
4
  > This guide only provides some examples of deployment commands for Kimi-K2-Thinking, which may not be the optimal configuration. Since inference engines are still being updated frequenty, please continue to follow the guidance from their homepage if you want to achieve better inference performance.
5
 
 
 
6
 
7
  ## vLLM Deployment
8
 
9
  The smallest deployment unit for Kimi-K2-Thinking INT4 weights with 256k seqlen on mainstream H200 platform is a cluster with 8 GPUs with Tensor Parallel (TP).
10
  Running parameters for this environment are provided below. For other parallelism strategies, please refer to updates of official documents.
11
 
12
- Nightly version is needed for reasoning parser.
13
 
14
  ### Tensor Parallelism
15
 
@@ -38,9 +39,6 @@ vllm serve $MODEL_PATH \
38
 
39
  Similarly, here are the examples using TP in SGLang for Deployment.
40
 
41
- Nightly version is needed for reasoning parser.
42
-
43
-
44
  ### Tensor Parallelism
45
 
46
  Here is the simple example code to run TP8 on H200 in a sigle node:
 
3
  > [!Note]
4
  > This guide only provides some examples of deployment commands for Kimi-K2-Thinking, which may not be the optimal configuration. Since inference engines are still being updated frequenty, please continue to follow the guidance from their homepage if you want to achieve better inference performance.
5
 
6
+ > kimi_k2 reasoning parser and other related features have been merged into vLLM/sglang and will be available in the next release. For now, please use the nightly build Docker image.
7
+
8
 
9
  ## vLLM Deployment
10
 
11
  The smallest deployment unit for Kimi-K2-Thinking INT4 weights with 256k seqlen on mainstream H200 platform is a cluster with 8 GPUs with Tensor Parallel (TP).
12
  Running parameters for this environment are provided below. For other parallelism strategies, please refer to updates of official documents.
13
 
 
14
 
15
  ### Tensor Parallelism
16
 
 
39
 
40
  Similarly, here are the examples using TP in SGLang for Deployment.
41
 
 
 
 
42
  ### Tensor Parallelism
43
 
44
  Here is the simple example code to run TP8 on H200 in a sigle node: