Instructions to use AXERA-TECH/Qwen3-1.7B-GPTQ-Int4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AXERA-TECH/Qwen3-1.7B-GPTQ-Int4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="AXERA-TECH/Qwen3-1.7B-GPTQ-Int4")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AXERA-TECH/Qwen3-1.7B-GPTQ-Int4", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use AXERA-TECH/Qwen3-1.7B-GPTQ-Int4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AXERA-TECH/Qwen3-1.7B-GPTQ-Int4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AXERA-TECH/Qwen3-1.7B-GPTQ-Int4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/AXERA-TECH/Qwen3-1.7B-GPTQ-Int4
- SGLang
How to use AXERA-TECH/Qwen3-1.7B-GPTQ-Int4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AXERA-TECH/Qwen3-1.7B-GPTQ-Int4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AXERA-TECH/Qwen3-1.7B-GPTQ-Int4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AXERA-TECH/Qwen3-1.7B-GPTQ-Int4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AXERA-TECH/Qwen3-1.7B-GPTQ-Int4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use AXERA-TECH/Qwen3-1.7B-GPTQ-Int4 with Docker Model Runner:
docker model run hf.co/AXERA-TECH/Qwen3-1.7B-GPTQ-Int4
Qwen3-1.7B-GPTQ-Int4
This version of Qwen3-1.7B-GPTQ-Int4 has been converted to run on the Axera NPU using w4a16 quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 5.2
Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen3-1.7B
Pulsar2 Link, How to Convert LLM from Huggingface to axmodel
Convert the original Huggingface Qwen3-1.7B-GPTQ-Int4 to axmodel, and then apply the w4a16 quantization to get the final axmodel for axllm runtime.
export FLOAT_MATMUL_USE_CONV_EU=1 # only support AX650, for better performance, please set this env var before running the conversion command.
# context window size 2048, prefill length 1024
pulsar2 llm_build --input_path Qwen3-1.7B-GPTQ-Int4 --output_path <your path> \
--hidden_state_type bf16 --kv_cache_len 2048 --prefill_len 128 --chip AX650 -c 1 --parallel 32 \
--last_kv_cache_len 128 --last_kv_cache_len 256 --last_kv_cache_len 384 --last_kv_cache_len 512 \
--last_kv_cache_len 640 --last_kv_cache_len 768 --last_kv_cache_len 896 --last_kv_cache_len 1024 -w s4
Support Platform
| Chips | w4a16 | CMM | Flash |
|---|---|---|---|
| AX650 | 12.72 tokens/sec | 1.7 GiB | 1.9GiB |
How to use
安装 axllm
方式一:克隆仓库后执行安装脚本:
git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh
方式二:一行命令安装(默认分支 axllm):
curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash
方式三:下载Github Actions CI 导出的可执行程序(适合没有编译环境的用户):
如果没有编译环境,请到:
https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm
下载 最新 CI 导出的可执行程序(axllm),然后:
chmod +x axllm
sudo mv axllm /usr/bin/axllm
模型下载(Hugging Face)
先创建模型目录并进入,然后下载到该目录:
mkdir -p AXERA-TECH/Qwen3-1.7B-GPTQ-Int4
cd AXERA-TECH/Qwen3-1.7B-GPTQ-Int4
hf download AXERA-TECH/Qwen3-1.7B-GPTQ-Int4 --local-dir .
# structure of the downloaded files
tree -L 3
.
└── AXERA-TECH
└── Qwen3-1.7B-GPTQ-Int4
├── README.md
├── config.json
├── model.embed_tokens.weight.bfloat16.bin
├── post_config.json
├── qwen3_p128_l0_together.axmodel
...
├── qwen3_p128_l9_together.axmodel
├── qwen3_post.axmodel
└── qwen3_tokenizer.txt
2 directories, 34 files
Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board
运行(CLI)
(base) root@ax650:~# axllm run AXERA-TECH/Qwen3-1.7B-GPTQ-Int4/
14:39:39.955 INF Init:890 | LLM init start
tokenizer_type = 1
96% | ############################## | 30 / 31 [3.85s<3.98s, 7.79 count/s] init post axmodel ok,remain_cmm(8230 MB)
14:39:43.809 INF Init:1045 | max_token_len : 2048
14:39:43.809 INF Init:1048 | kv_cache_size : 1024, kv_cache_num: 2048
14:39:43.809 INF init_groups_from_model:606 | prefill_token_num : 128
14:39:43.809 INF init_groups_from_model:820 | decode grp: 0, gid: 0, max_token_len : 2048
14:39:43.809 INF init_groups_from_model:824 | prefill grp: 0, gid: 1, history_cap: 0, total_cap: 128, symbolic_cap: 1
14:39:43.809 INF init_groups_from_model:824 | prefill grp: 1, gid: 2, history_cap: 128, total_cap: 256, symbolic_cap: 128
14:39:43.809 INF init_groups_from_model:824 | prefill grp: 2, gid: 3, history_cap: 256, total_cap: 384, symbolic_cap: 256
14:39:43.809 INF init_groups_from_model:824 | prefill grp: 3, gid: 4, history_cap: 384, total_cap: 512, symbolic_cap: 384
14:39:43.809 INF init_groups_from_model:824 | prefill grp: 4, gid: 5, history_cap: 512, total_cap: 640, symbolic_cap: 512
14:39:43.809 INF init_groups_from_model:824 | prefill grp: 5, gid: 6, history_cap: 640, total_cap: 768, symbolic_cap: 640
14:39:43.809 INF init_groups_from_model:824 | prefill grp: 6, gid: 7, history_cap: 768, total_cap: 896, symbolic_cap: 768
14:39:43.809 INF init_groups_from_model:824 | prefill grp: 7, gid: 8, history_cap: 896, total_cap: 1024, symbolic_cap: 896
14:39:43.809 INF init_groups_from_model:824 | prefill grp: 8, gid: 9, history_cap: 1024, total_cap: 1152, symbolic_cap: 1024
14:39:43.809 INF init_groups_from_model:831 | prefill_max_token_num : 1152
14:39:43.809 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ | 31 / 31 [3.85s<3.85s, 8.04 count/s] embed_selector init ok
14:39:43.810 INF load_config:282 | load config:
14:39:43.810 INF load_config:282 | {
14:39:43.810 INF load_config:282 | "enable_repetition_penalty": false,
14:39:43.810 INF load_config:282 | "enable_temperature": false,
14:39:43.810 INF load_config:282 | "enable_top_k_sampling": false,
14:39:43.810 INF load_config:282 | "enable_top_p_sampling": false,
14:39:43.810 INF load_config:282 | "penalty_window": 20,
14:39:43.810 INF load_config:282 | "repetition_penalty": 1.2,
14:39:43.810 INF load_config:282 | "temperature": 0.9,
14:39:43.810 INF load_config:282 | "top_k": 10,
14:39:43.810 INF load_config:282 | "top_p": 0.8
14:39:43.810 INF load_config:282 | }
14:39:43.810 INF Init:1139 | LLM init ok
Commands:
/q, /exit 退出
/reset 重置 kvcache
/dd 删除一轮对话
/pp 打印历史对话
Ctrl+C: 停止当前生成
----------------------------------------
prompt >> who are you
14:39:51.617 INF SetKVCache:1437 | decode_grpid:0 prefill_grpid:1 history_cap:0 total_cap:128 symbolic_cap:1 precompute_len:0 input_num_token:23 prefer_symbolic_group:0
14:39:51.617 INF SetKVCache:1458 | current prefill_max_token_num:1152
14:39:51.713 INF SetKVCache:1462 | first run
14:39:51.715 INF Run:1553 | input token num : 23, prefill_split_num : 1
14:39:51.715 INF Run:1640 | prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=23
14:39:51.715 INF Run:1665 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
14:39:51.866 INF Run:1837 | ttft: 151.12 ms
<think>
Okay, the user asked, "Who are you?" I need to respond appropriately. Let me think.
First, I should acknowledge their question and clarify my role. I'm an AI assistant, so I should mention that. But I need to keep it friendly and not too technical. Also, I should mention that I can help with various tasks, like answering questions, writing, or providing information. But I need to make sure not to mention any specific functions that might be too detailed. Also, I should avoid using markdown and keep the response natural.
Wait, the user might be testing if I'm a real person or an AI. I should clarify that I'm an AI assistant, not a human. But I should also mention that I can assist with various tasks. Need to keep it positive and helpful. Also, avoid any mention of specific functions that might be too detailed. Let me structure the response: greet them, state I'm an AI assistant, mention I can help with tasks, and offer to assist. Keep it concise and friendly.
</think>
I'm an AI assistant here to help you with questions, tasks, and more! 😊 I can answer your questions, write content, and assist with various tasks. Just let me know what you need!
14:40:12.236 NTC Run:2102 | hit eos,decode avg 12.57 token/s
14:40:12.236 INF GetKVCache:1408 | precompute_len:280, remaining:872
prompt >> /q
启动服务(OpenAI 兼容)
(base) root@ax650:~# axllm serve AXERA-TECH/Qwen3-1.7B-GPTQ-Int4/
14:41:17.619 INF Init:890 | LLM init start
tokenizer_type = 1
96% | ############################## | 30 / 31 [2.60s<2.69s, 11.54 count/s] init post axmodel ok,remain_cmm(8230 MB)
14:41:20.219 INF Init:1045 | max_token_len : 2048
14:41:20.219 INF Init:1048 | kv_cache_size : 1024, kv_cache_num: 2048
14:41:20.219 INF init_groups_from_model:606 | prefill_token_num : 128
14:41:20.219 INF init_groups_from_model:820 | decode grp: 0, gid: 0, max_token_len : 2048
14:41:20.219 INF init_groups_from_model:824 | prefill grp: 0, gid: 1, history_cap: 0, total_cap: 128, symbolic_cap: 1
14:41:20.219 INF init_groups_from_model:824 | prefill grp: 1, gid: 2, history_cap: 128, total_cap: 256, symbolic_cap: 128
14:41:20.219 INF init_groups_from_model:824 | prefill grp: 2, gid: 3, history_cap: 256, total_cap: 384, symbolic_cap: 256
14:41:20.219 INF init_groups_from_model:824 | prefill grp: 3, gid: 4, history_cap: 384, total_cap: 512, symbolic_cap: 384
14:41:20.219 INF init_groups_from_model:824 | prefill grp: 4, gid: 5, history_cap: 512, total_cap: 640, symbolic_cap: 512
14:41:20.219 INF init_groups_from_model:824 | prefill grp: 5, gid: 6, history_cap: 640, total_cap: 768, symbolic_cap: 640
14:41:20.219 INF init_groups_from_model:824 | prefill grp: 6, gid: 7, history_cap: 768, total_cap: 896, symbolic_cap: 768
14:41:20.219 INF init_groups_from_model:824 | prefill grp: 7, gid: 8, history_cap: 896, total_cap: 1024, symbolic_cap: 896
14:41:20.219 INF init_groups_from_model:824 | prefill grp: 8, gid: 9, history_cap: 1024, total_cap: 1152, symbolic_cap: 1024
14:41:20.219 INF init_groups_from_model:831 | prefill_max_token_num : 1152
14:41:20.219 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ | 31 / 31 [2.60s<2.60s, 11.92 count/s] embed_selector init ok
14:41:20.220 INF load_config:282 | load config:
14:41:20.220 INF load_config:282 | {
14:41:20.220 INF load_config:282 | "enable_repetition_penalty": false,
14:41:20.220 INF load_config:282 | "enable_temperature": false,
14:41:20.220 INF load_config:282 | "enable_top_k_sampling": false,
14:41:20.220 INF load_config:282 | "enable_top_p_sampling": false,
14:41:20.220 INF load_config:282 | "penalty_window": 20,
14:41:20.220 INF load_config:282 | "repetition_penalty": 1.2,
14:41:20.220 INF load_config:282 | "temperature": 0.9,
14:41:20.220 INF load_config:282 | "top_k": 10,
14:41:20.220 INF load_config:282 | "top_p": 0.8
14:41:20.220 INF load_config:282 | }
14:41:20.220 INF Init:1139 | LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/Qwen3-1.7B-GPTQ-Int4'...
API URLs:
GET http://127.0.0.1:8000/health
GET http://127.0.0.1:8000/v1/models
POST http://127.0.0.1:8000/v1/chat/completions
GET http://10.126.29.54:8000/health
GET http://10.126.29.54:8000/v1/models
POST http://10.126.29.54:8000/v1/chat/completions
GET http://172.17.0.1:8000/health
GET http://172.17.0.1:8000/v1/models
POST http://172.17.0.1:8000/v1/chat/completions
Aliases:
GET http://127.0.0.1:8000/models
POST http://127.0.0.1:8000/chat/completions
GET http://10.126.29.54:8000/models
POST http://10.126.29.54:8000/chat/completions
GET http://172.17.0.1:8000/models
POST http://172.17.0.1:8000/chat/completions
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/Qwen3-1.7B-GPTQ-Int4
OpenAI 调用示例
from openai import OpenAI
API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3-1.7B-GPTQ-Int4"
messages = [
{"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
{"role": "user", "content": "hello"},
]
client = OpenAI(api_key="not-needed", base_url=API_URL)
completion = client.chat.completions.create(
model=MODEL,
messages=messages,
)
print(completion.choices[0].message.content)
OpenAI 流式调用示例
from openai import OpenAI
API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3-1.7B-GPTQ-Int4"
messages = [
{"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
{"role": "user", "content": "hello"},
]
client = OpenAI(api_key="not-needed", base_url=API_URL)
stream = client.chat.completions.create(
model=MODEL,
messages=messages,
stream=True,
)
print("assistant:")
for ev in stream:
delta = getattr(ev.choices[0], "delta", None)
if delta and getattr(delta, "content", None):
print(delta.content, end="", flush=True)
print("")
- Downloads last month
- 21