Random crash of model.

#2
by kuliev-vitaly - opened

docker run --log-opt max-size=10m --log-opt max-file=1 --rm -it --gpus '"device=0,1"' -p 8000:8000 -e VLLM_SKIP_SPECIAL_TOKENS=true vllm/vllm-openai:v0.11.0 --model QuantTrio/Qwen3-VL-235B-A22B-Thinking-AWQ --gpu-memory-utilization 0.95 --max-model-len 100000 --trust-remote-code --swap-space 32 --tensor-parallel-size 2 --port 8000 --max-num-seqs 32 --tokenizer-mode auto --no-enable-prefix-caching --enforce-eager

I have server with two nvidia A800 80gb. I starting a docker on two gpu. Model quality is great! Model works with api fine, but randomly crashed. GPU OOM errors fixed before - about 2-3Gb of gpu ram available. Ram and disk are also available.
I have tried options(No one solved the problem):
--no-enable-prefix-caching
--enforce-eager
--swap-space
--enable-expert-parallel

How correctly start vllm in docker for a model?

(APIServer pid=1) INFO: 10.162.70.98:52782 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 10-07 04:22:38 [loggers.py:127] Engine 000: Avg prompt throughput: 301.9 tokens/s, Avg generation throughput:
133.8 tokens/s, Running: 20 reqs, Waiting: 0 reqs, GPU KV cache usage: 75.6%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 10-07 04:22:48 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 16
8.0 tokens/s, Running: 20 reqs, Waiting: 0 reqs, GPU KV cache usage: 77.1%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO: 10.162.70.98:36264 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(Worker_TP0 pid=309) INFO 10-07 04:22:53 [multiproc_executor.py:558] Parent process exited, terminating worker
(APIServer pid=1) ERROR 10-07 04:22:53 [core_client.py:564] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.
(Worker_TP0 pid=309) INFO 10-07 04:22:53 [multiproc_executor.py:599] WorkerProc shutting down.
(Worker_TP1 pid=310) INFO 10-07 04:22:53 [multiproc_executor.py:558] Parent process exited, terminating worker
(Worker_TP1 pid=310) INFO 10-07 04:22:53 [multiproc_executor.py:599] WorkerProc shutting down.
(APIServer pid=1) ERROR 10-07 04:22:54 [async_llm.py:480] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR 10-07 04:22:54 [async_llm.py:480] Traceback (most recent call last):
(APIServer pid=1) ERROR 10-07 04:22:54 [async_llm.py:480] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py
", line 439, in output_handler
(APIServer pid=1) ERROR 10-07 04:22:54 [async_llm.py:480] outputs = await engine_core.get_output_async()
(APIServer pid=1) ERROR 10-07 04:22:54 [async_llm.py:480] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 10-07 04:22:54 [async_llm.py:480] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.
py", line 846, in get_output_async
(APIServer pid=1) ERROR 10-07 04:22:54 [async_llm.py:480] raise self._format_exception(outputs) from None
(APIServer pid=1) ERROR 10-07 04:22:54 [async_llm.py:480] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
See stack trace (above) for the root cause.
(APIServer pid=1) INFO: 10.162.70.98:54825 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 10.162.70.98:58764 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 10.162.70.98:44199 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 10.162.70.98:39835 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
....
(APIServer pid=1) INFO: 10.162.70.98:30339 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 10.162.70.98:4354 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 10.162.70.98:47856 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 10.162.70.98:22956 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 10.162.70.98:58694 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error (APIServer pid=1) INFO: 10.162.70.98:39044 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error (APIServer pid=1) INFO: Shutting down (APIServer pid=1) INFO: Waiting for application shutdown. (APIServer pid=1) INFO: Application shutdown complete.

QuantTrio org
edited Oct 7

emmm,
(1) have you tried to lower gpu-memory-utilization 0.95 to 0.90 or 0.85 etc?
(2) I think 80GB is far more than capable of serving a3b with 110k context. Could you try to serve it with 1 card, and check if the issue persists.

but in general, i think this needs to be reported to vLLM directly (along with your system info and python environment)

Yes i tried lower gpu-memory-utilization. Problem still exists. Current utilization can give about 100k context on two 80gb gpu. Model weghts are 125 gb - one 80gb gpu not enough.

QuantTrio org
edited Oct 7

my bad, i thought this is the 30b a3b repo 🥲
anyway, i haven't encountered this issue,
in fact, i felt that the nightly version before / after Sep 28 is better than the official v0.11.0
so maybe you can do a quick check on those versions,
https://github.com/vllm-project/vllm/tags

maybe start with v0.11.0rc2

QuantTrio org

but yeah, this should definitely reported to vLLM or Qwen team

docker vllm/vllm-openai:nightly - the same problem

my bad, i thought this is the 30b a3b repo 🥲
anyway, i haven't encountered this issue,
in fact, i felt that the nightly version before / after Sep 28 is better than the official v0.11.0
so maybe you can do a quick check on those versions,
https://github.com/vllm-project/vllm/tags

maybe start with v0.11.0rc2

Previous builds does not support qwen3_vl_moe architecture:
Value error, The checkpoint you are trying to load has model type qwen3_vl_moe but Transformers does not recognize this architecture.

Sign up or log in to comment