liushaowei
commited on
Commit
·
c9caddd
1
Parent(s):
f1cdb45
update
Browse files- README.md +8 -6
- docs/deploy_guidance.md +2 -4
README.md
CHANGED
|
@@ -26,7 +26,7 @@ library_name: transformers
|
|
| 26 |
</div>
|
| 27 |
|
| 28 |
<p align="center">
|
| 29 |
-
<b>📰 <a href="https://moonshotai.github.io/Kimi-K2
|
| 30 |
</p>
|
| 31 |
|
| 32 |
|
|
@@ -167,19 +167,21 @@ Deployment examples can be found in the [Model Deployment Guide](docs/deploy_gui
|
|
| 167 |
Once the local inference service is up, you can interact with it through the chat endpoint:
|
| 168 |
|
| 169 |
```python
|
| 170 |
-
def simple_chat(client: OpenAI, model_name: str):
|
| 171 |
messages = [
|
| 172 |
{"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
|
| 173 |
-
{"role": "user", "content": [{"type": "text", "text": "
|
| 174 |
]
|
| 175 |
response = client.chat.completions.create(
|
| 176 |
model=model_name,
|
| 177 |
messages=messages,
|
| 178 |
stream=False,
|
| 179 |
temperature=1.0,
|
| 180 |
-
max_tokens=
|
| 181 |
)
|
| 182 |
-
print(response.choices[0].message.content)
|
|
|
|
|
|
|
| 183 |
```
|
| 184 |
|
| 185 |
> [!NOTE]
|
|
@@ -232,7 +234,7 @@ def tool_call_with_client(client: OpenAI, model_name: str):
|
|
| 232 |
completion = client.chat.completions.create(
|
| 233 |
model=model_name,
|
| 234 |
messages=messages,
|
| 235 |
-
temperature=0
|
| 236 |
tools=tools, # tool list defined above
|
| 237 |
tool_choice="auto"
|
| 238 |
)
|
|
|
|
| 26 |
</div>
|
| 27 |
|
| 28 |
<p align="center">
|
| 29 |
+
<b>📰 <a href="https://moonshotai.github.io/Kimi-K2/thinking.html">Tech Blog</a></b>
|
| 30 |
</p>
|
| 31 |
|
| 32 |
|
|
|
|
| 167 |
Once the local inference service is up, you can interact with it through the chat endpoint:
|
| 168 |
|
| 169 |
```python
|
| 170 |
+
def simple_chat(client: openai.OpenAI, model_name: str):
|
| 171 |
messages = [
|
| 172 |
{"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
|
| 173 |
+
{"role": "user", "content": [{"type": "text", "text": "which one is bigger, 9.11 or 9.9? think carefully."}]},
|
| 174 |
]
|
| 175 |
response = client.chat.completions.create(
|
| 176 |
model=model_name,
|
| 177 |
messages=messages,
|
| 178 |
stream=False,
|
| 179 |
temperature=1.0,
|
| 180 |
+
max_tokens=4096
|
| 181 |
)
|
| 182 |
+
print(f"k2 answer: {response.choices[0].message.content}")
|
| 183 |
+
print("=====below is reasoning content======")
|
| 184 |
+
print(f"reasoning content: {response.choices[0].message.reasoning_content}")
|
| 185 |
```
|
| 186 |
|
| 187 |
> [!NOTE]
|
|
|
|
| 234 |
completion = client.chat.completions.create(
|
| 235 |
model=model_name,
|
| 236 |
messages=messages,
|
| 237 |
+
temperature=1.0,
|
| 238 |
tools=tools, # tool list defined above
|
| 239 |
tool_choice="auto"
|
| 240 |
)
|
docs/deploy_guidance.md
CHANGED
|
@@ -3,13 +3,14 @@
|
|
| 3 |
> [!Note]
|
| 4 |
> This guide only provides some examples of deployment commands for Kimi-K2-Thinking, which may not be the optimal configuration. Since inference engines are still being updated frequenty, please continue to follow the guidance from their homepage if you want to achieve better inference performance.
|
| 5 |
|
|
|
|
|
|
|
| 6 |
|
| 7 |
## vLLM Deployment
|
| 8 |
|
| 9 |
The smallest deployment unit for Kimi-K2-Thinking INT4 weights with 256k seqlen on mainstream H200 platform is a cluster with 8 GPUs with Tensor Parallel (TP).
|
| 10 |
Running parameters for this environment are provided below. For other parallelism strategies, please refer to updates of official documents.
|
| 11 |
|
| 12 |
-
Nightly version is needed for reasoning parser.
|
| 13 |
|
| 14 |
### Tensor Parallelism
|
| 15 |
|
|
@@ -38,9 +39,6 @@ vllm serve $MODEL_PATH \
|
|
| 38 |
|
| 39 |
Similarly, here are the examples using TP in SGLang for Deployment.
|
| 40 |
|
| 41 |
-
Nightly version is needed for reasoning parser.
|
| 42 |
-
|
| 43 |
-
|
| 44 |
### Tensor Parallelism
|
| 45 |
|
| 46 |
Here is the simple example code to run TP8 on H200 in a sigle node:
|
|
|
|
| 3 |
> [!Note]
|
| 4 |
> This guide only provides some examples of deployment commands for Kimi-K2-Thinking, which may not be the optimal configuration. Since inference engines are still being updated frequenty, please continue to follow the guidance from their homepage if you want to achieve better inference performance.
|
| 5 |
|
| 6 |
+
> kimi_k2 reasoning parser and other related features have been merged into vLLM/sglang and will be available in the next release. For now, please use the nightly build Docker image.
|
| 7 |
+
|
| 8 |
|
| 9 |
## vLLM Deployment
|
| 10 |
|
| 11 |
The smallest deployment unit for Kimi-K2-Thinking INT4 weights with 256k seqlen on mainstream H200 platform is a cluster with 8 GPUs with Tensor Parallel (TP).
|
| 12 |
Running parameters for this environment are provided below. For other parallelism strategies, please refer to updates of official documents.
|
| 13 |
|
|
|
|
| 14 |
|
| 15 |
### Tensor Parallelism
|
| 16 |
|
|
|
|
| 39 |
|
| 40 |
Similarly, here are the examples using TP in SGLang for Deployment.
|
| 41 |
|
|
|
|
|
|
|
|
|
|
| 42 |
### Tensor Parallelism
|
| 43 |
|
| 44 |
Here is the simple example code to run TP8 on H200 in a sigle node:
|