jart25/Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ · Running on 2xAI Pro R9700s via ROCM 7 and nightly VLLM with this

Oct 9

•

Testing this as startup command, so far so good, stressing to find failure still and need to find max ctx size still.

Max viable context for single user is ~160k on 2xR9700s.

Model maintains coherence through single, 160k token prompt test maintained average of ~600 PP tps speed and ~24 TG tps speed for the stress test response.

Updated command that has been completely stable and coherent under huge pressure:
[Unit]
Description=vLLM Docker Container - Qwen3-Next-80B-Int4-GPTQ (Production - Stable)
After=docker.service
Requires=docker.service

[Service]
Type=simple
Restart=always
RestartSec=10
ExecStartPre=/usr/bin/docker rm -f vllm-qwen3-next
ExecStart=/usr/bin/docker run --rm --name vllm-qwen3-next
--network=host
--group-add=video
--ipc=host
--shm-size=32gb
--cap-add=SYS_PTRACE
--security-opt seccomp=unconfined
--privileged
--device /dev/kfd
--device /dev/dri
-e HSA_OVERRIDE_GFX_VERSION=12.0.1
-e VLLM_USE_V1=1
-v /mnt/shared-drive/Models/jart25/Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ:/app/models
my-vllm-rocm7-built:latest
vllm serve /app/models
--quantization gptq
--dtype float16
--tensor-parallel-size 2
--enforce-eager
--enable-chunked-prefill
--max-num-batched-tokens 2048
--enable-auto-tool-choice
--tool-call-parser hermes
--max-model-len 150000
--max-num-seqs 2
--gpu-memory-utilization 0.92
--host 0.0.0.0
--port 8079
--served-model-name Qwen3-Next
--trust-remote-code
--disable-log-requests
ExecStop=/usr/bin/docker stop vllm-qwen3-next

[Install]
WantedBy=multi-user.target

EDIT:
With the updated nightly published yesterday, cuda graphs now work with TP 2 using gfx1201. Running my test suite for benching LLM throughput I get this now
All tests use ROCM 7, but only the Qwen3-Next-Graphs entry has CUDA graphs enabled via the most recent nightly.
Limiting power to 225 per card as in this test shows a retention of about 98% of the 300 watt stock limit for PP and zero loss in TG.

Model is awesome... when Prefix Caching is enabled this is going to be an amazing local setup paired with a KG/RAG db i keep in ram to give crazy fast searches!

Still another ~30% speed missing due to no Triton or TunableOP capability on 1201 yet, I patched Ray to do it a few nights ago but kept running into invalid memory access after an 60-70 minutes of the vLLM benchmark running.

tcclaviger

Oct 9

Tool calls working correctly to custom MCP.

jart25

Owner Oct 9

Tool calls working correctly to custom MCP.

docker run -d
--name vllm1
--restart unless-stopped
--shm-size '48gb'
--network host
--ipc host
--privileged
--cap-add SYS_PTRACE
--security-opt seccomp=unconfined
--group-add video
-e HSA_OVERRIDE_GFX_VERSION=11.0.0
-e VLLM_USE_V1=1
-e VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0
--device /dev/kfd:/dev/kfd
--device /dev/dri:/dev/dri
-v ./models:/models
vllm-rocm-251007
vllm serve
"/models/Qwen3-Next-80B-A3B-Instruct-w4g128"
--gpu-memory-utilization "0.95"
--max_model_len "262144"
-tp "4"
--served-model-name "QWEN3_80"
--port "80"
--disable-log-requests
--dtype=float16
--tool-call-parser "hermes"
--chat-template "/chat-template-tools.jinja"
--override-generation-config '{"temperature": 0.7, "top_p": 0.8, "top_k": 20, "repeat_penality": 1.05}'
--enable-auto-tool-choice
--compilation-config '{"full_cuda_graph": true}'
--no-enable-chunked-prefill
It is my command to run the model

tcclaviger

Oct 9

•

edited Oct 9

Thanks jart, i'll see if the flags make a difference.
--chat-template "/chat-template-tools.jinja"
--compilation-config '{"full_cuda_graph": true}'
--no-enable-chunked-prefill

So far my startup command is performing flawlessly.

Appreciate you posting this, finding a quant to run on these cards was, challenging.

jart25

Owner Oct 9

from unsloth

{%- if tools %}
{{- '<|im_start|>system\n' }}
{%- if messages[0].role == 'system' %}
{{- messages[0].content + '\n\n' }}
{%- endif %}
{{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within XML tags:\n" }}
{%- for tool in tools %}
{{- "\n" }}
{{- tool | tojson }}
{%- endfor %}
{{- "\n\n\nFor each function call, return a json object with function name and arguments within XML tags:\n\n{"name": , "arguments": }\n<|im_end|>\n" }}
{%- else %}
{%- if messages[0].role == 'system' %}
{{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
{%- set index = (messages|length - 1) - loop.index0 %}
{%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('') and message.content.endswith('')) %}
{%- set ns.multi_step_tool = false %}
{%- set ns.last_query_index = index %}
{%- endif %}
{%- endfor %}
{%- for message in messages %}
{%- if message.content is string %}
{%- set content = message.content %}
{%- else %}
{%- set content = '' %}
{%- endif %}
{%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
{{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
{%- elif message.role == "assistant" %}
{%- set reasoning_content = '' %}
{%- if message.reasoning_content is string %}
{%- set reasoning_content = message.reasoning_content %}
{%- else %}
{%- if '' in content %}
{%- set reasoning_content = content.split('')[0].rstrip('\n').split('')[-1].lstrip('\n') %}
{%- set content = content.split('')[-1].lstrip('\n') %}
{%- endif %}
{%- endif %}
{%- if loop.index0 > ns.last_query_index %}
{%- if loop.last or (not loop.last and reasoning_content) %}
{{- '<|im_start|>' + message.role + '\n\n' + reasoning_content.strip('\n') + '\n\n\n' + content.lstrip('\n') }}
{%- else %}
{{- '<|im_start|>' + message.role + '\n' + content }}
{%- endif %}
{%- else %}
{{- '<|im_start|>' + message.role + '\n' + content }}
{%- endif %}
{%- if message.tool_calls %}
{%- for tool_call in message.tool_calls %}
{%- if (loop.first and content) or (not loop.first) %}
{{- '\n' }}
{%- endif %}
{%- if tool_call.function %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{{- '\n{"name": "' }}
{{- tool_call.name }}
{{- '", "arguments": ' }}
{%- if tool_call.arguments is string %}
{{- tool_call.arguments }}
{%- else %}
{{- tool_call.arguments | tojson }}
{%- endif %}
{{- '}\n' }}
{%- endfor %}
{%- endif %}
{{- '<|im_end|>\n' }}
{%- elif message.role == "tool" %}
{%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
{{- '<|im_start|>user' }}
{%- endif %}
{{- '\n\n' }}
{{- content }}
{{- '\n' }}
{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
{{- '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}
{%- endif %}

jart25

Owner Oct 9

•

edited Oct 9

Thanks jart, i'll see if the flags make a difference.
--chat-template "/chat-template-tools.jinja"
--compilation-config '{"full_cuda_graph": true}'
--no-enable-chunked-prefill

So far my startup command is performing flawlessly.

Appreciate you posting this, finding a quant to run on these cards was, challenging.

It was quite a challenge, I ran into the same problem executing the MOEs (Mixtures of Experts), the limitations of rocm-no-aiter.

Rocm no Aiter barely has support for quantizations; this is the only one that, along with IsoTropy, has been left stable. It involved figuring out how the fused_moe were executed and more stories, so I had to fix it, hahaha. Anything you need, I'm here for you :)

tcclaviger

Oct 9

Thanks and just fyi:
gfx1201 on rocm7 crashes generating cuda graphs, ho-hum, kind of expected.

Thus far this is hands down the absolute best model for my hardware, the right size of parameters, fast enough, tool calls without failures,

Amazing. Will see how it performs as agent over the weekend, but based on what I've seen thus far... it'll do great.

jart25

Owner Oct 9

Post me the log and the full docker run command please

tcclaviger

Oct 9

•

edited 29 days ago

Removing as the problem was that cuda graphs and the previous version of rocm/vllm i was using were fundamentally incompatible.