Instructions to use QuixiAI/DeepSeek-R1-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QuixiAI/DeepSeek-R1-AWQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="QuixiAI/DeepSeek-R1-AWQ", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("QuixiAI/DeepSeek-R1-AWQ", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("QuixiAI/DeepSeek-R1-AWQ", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use QuixiAI/DeepSeek-R1-AWQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "QuixiAI/DeepSeek-R1-AWQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuixiAI/DeepSeek-R1-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/QuixiAI/DeepSeek-R1-AWQ

SGLang

How to use QuixiAI/DeepSeek-R1-AWQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "QuixiAI/DeepSeek-R1-AWQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuixiAI/DeepSeek-R1-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "QuixiAI/DeepSeek-R1-AWQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuixiAI/DeepSeek-R1-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use QuixiAI/DeepSeek-R1-AWQ with Docker Model Runner:
```
docker model run hf.co/QuixiAI/DeepSeek-R1-AWQ
```

Any one can run this model with SGlang framework？

#13

by muziyongshixin - opened Feb 20, 2025

Discussion

muziyongshixin

Feb 20, 2025

•

edited Feb 20, 2025

I try to run this model with SGlang, but it is extremely slow. Does anyone have a good setting to run this model with SGlang?

v2ray

Feb 20, 2025

Try run this with vLLM, it is much faster.

Eric108

Feb 21, 2025

you can try this command python3 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model-path models/DeepSeek-R1-AWQ --tp 8 --enable-p2p-check --trust-remote-code --dtype float16 --mem-fraction-static 0.95 --served-model-name deepseek-r1-awq --disable-cuda-graph with sglang==0.4.2
but the result is not as expected..I get empty content for some queries and the think is not complete

oliver0102

Mar 6, 2025

I managed to run this model with sglang
python3 -m sglang.launch_server --model-path /home/service/var/models/deepseek-r1-huggingface/DeepSeek-R1-AWQ/ --trust-remote-code --tp 8 --mem-fraction-static 0.8 --dtype float16 --host 0.0.0.0 --port 9000 --disable-radix --disable-custom-all-reduce --log-requests --cuda-graph-max-bs 16 --max-total-tokens 65536. It runs on my 8*H20 with 30t/s. However i have same issue as @Eric10 . Tried some parameter but finally give up. I am now trying to use q4 gguf model.

isofun

Mar 21, 2025

My decoding running much more slowly with sglang 0.4.4 than fp8 and there are always some strange output.

v2ray

Mar 21, 2025

What's the reason that you must use SGLang instead of vLLM, since vLLM got all the features now, MLA, MTP, a mush faster fused MoE Marlin kernel, etc..

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment