moonshotai
/

Kimi-K2-Thinking

@@ -26,7 +26,7 @@ library_name: transformers
 </div>
 <p align="center">
-<b>📰&nbsp;&nbsp;<a href="https://moonshotai.github.io/Kimi-K2-Thinking/">Tech Blog</a></b>
 </p>
@@ -167,19 +167,21 @@ Deployment examples can be found in the [Model Deployment Guide](docs/deploy_gui
 Once the local inference service is up, you can interact with it through the chat endpoint:
 ```python
-def simple_chat(client: OpenAI, model_name: str):
     messages = [
         {"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
-        {"role": "user", "content": [{"type": "text", "text": "Please give a brief self-introduction."}]},
     ]
     response = client.chat.completions.create(
         model=model_name,
         messages=messages,
         stream=False,
         temperature=1.0,
-        max_tokens=256
     )
-    print(response.choices[0].message.content)
 ```
 > [!NOTE]
@@ -232,7 +234,7 @@ def tool_call_with_client(client: OpenAI, model_name: str):
         completion = client.chat.completions.create(
             model=model_name,
             messages=messages,
-            temperature=0.6,
             tools=tools,          # tool list defined above
             tool_choice="auto"
         )

 </div>
 <p align="center">
+<b>📰&nbsp;&nbsp;<a href="https://moonshotai.github.io/Kimi-K2/thinking.html">Tech Blog</a></b>
 </p>
 Once the local inference service is up, you can interact with it through the chat endpoint:
 ```python
+def simple_chat(client: openai.OpenAI, model_name: str):
     messages = [
         {"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
+        {"role": "user", "content": [{"type": "text", "text": "which one is bigger, 9.11 or 9.9? think carefully."}]},
     ]
     response = client.chat.completions.create(
         model=model_name,
         messages=messages,
         stream=False,
         temperature=1.0,
+        max_tokens=4096
     )
+    print(f"k2 answer: {response.choices[0].message.content}")
+    print("=====below is reasoning content======")
+    print(f"reasoning content: {response.choices[0].message.reasoning_content}")
 ```
 > [!NOTE]
         completion = client.chat.completions.create(
             model=model_name,
             messages=messages,
+            temperature=1.0,
             tools=tools,          # tool list defined above
             tool_choice="auto"
         )

docs/deploy_guidance.md CHANGED Viewed

@@ -3,13 +3,14 @@
 > [!Note]
 > This guide only provides some examples of deployment commands for Kimi-K2-Thinking, which may not be the optimal configuration. Since inference engines are still being updated frequenty,  please continue to follow the guidance from their homepage if you want to achieve better inference performance.
 ## vLLM Deployment
 The smallest deployment unit for Kimi-K2-Thinking INT4 weights with 256k seqlen on mainstream H200 platform is a cluster with 8 GPUs with Tensor Parallel (TP).
 Running parameters for this environment are provided below. For other parallelism strategies, please refer to updates of official documents.
-Nightly version is needed for reasoning parser.
 ### Tensor Parallelism
@@ -38,9 +39,6 @@ vllm serve $MODEL_PATH \
 Similarly, here are the examples using TP in SGLang for Deployment.
-Nightly version is needed for reasoning parser.
 ### Tensor Parallelism
 Here is the simple example code to run TP8 on H200 in a sigle node:

 > [!Note]
 > This guide only provides some examples of deployment commands for Kimi-K2-Thinking, which may not be the optimal configuration. Since inference engines are still being updated frequenty,  please continue to follow the guidance from their homepage if you want to achieve better inference performance.
+> kimi_k2 reasoning parser and other related features have been merged into vLLM/sglang and will be available in the next release. For now, please use the nightly build Docker image.
 ## vLLM Deployment
 The smallest deployment unit for Kimi-K2-Thinking INT4 weights with 256k seqlen on mainstream H200 platform is a cluster with 8 GPUs with Tensor Parallel (TP).
 Running parameters for this environment are provided below. For other parallelism strategies, please refer to updates of official documents.
 ### Tensor Parallelism
 Similarly, here are the examples using TP in SGLang for Deployment.
 ### Tensor Parallelism
 Here is the simple example code to run TP8 on H200 in a sigle node: