vllm crashing on 0.19.0

#38
by evilperson068 - opened
(EngineCore pid=67010)   File "/opt/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=67010)     return func(*args, **kwargs)
(EngineCore pid=67010)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=67010)   File "/opt/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 332, in execute_model
(EngineCore pid=67010)     return self.worker.execute_model(scheduler_output)
(EngineCore pid=67010)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=67010)   File "/opt/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=67010)     return func(*args, **kwargs)
(EngineCore pid=67010)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=67010)   File "/opt/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 803, in execute_model
(EngineCore pid=67010)     output = self.model_runner.execute_model(
(EngineCore pid=67010)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=67010)   File "/opt/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=67010)     return func(*args, **kwargs)
(EngineCore pid=67010)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=67010)   File "/opt/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3992, in execute_model
(EngineCore pid=67010)     ) = self._preprocess(
(EngineCore pid=67010)         ^^^^^^^^^^^^^^^^^
(EngineCore pid=67010)   File "/opt/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3241, in _preprocess
(EngineCore pid=67010)     self.inputs_embeds.gpu[:num_scheduled_tokens].copy_(inputs_embeds_scheduled)
(EngineCore pid=67010) RuntimeError: The size of tensor a (4) must match the size of tensor b (3) at non-singleton dimension 0
(APIServer pid=66965) INFO:     192.168.1.77:59725 - "POST /v1/audio/transcriptions HTTP/1.1" 500 Internal Server Error
(APIServer pid=66965) INFO:     192.168.1.77:59724 - "POST /v1/audio/transcriptions HTTP/1.1" 500 Internal Server Error
(APIServer pid=66965) INFO:     192.168.1.77:59726 - "POST /v1/audio/transcriptions HTTP/1.1" 500 Internal Server Error
(APIServer pid=66965) INFO:     192.168.1.77:59727 - "POST /v1/audio/transcriptions HTTP/1.1" 500 Internal Server Error
[rank0]:[W408 21:32:09.594753642 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

GPU: RTX Pro 6000
Repro steps:

  1. install vllm 0.19.0.
  2. submit few requests to /v1/audio/transcriptions.
  3. crash.

Hi,

I am getting similar error on VLLM version 0.19.1. Did you find the version on which it runs fine?

RuntimeError: output with shape [1, 5120] doesn't match the broadcast shape [0, 5120]
INFO: Shutting down
[rank0]:[W421 09:09:07.042940582 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Hi,

I am getting similar error on VLLM version 0.19.1. Did you find the version on which it runs fine?

RuntimeError: output with shape [1, 5120] doesn't match the broadcast shape [0, 5120]
INFO: Shutting down
[rank0]:[W421 09:09:07.042940582 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

I managed to run using:

VLLM_DISABLE_COMPILE_CACHE=1 vllm serve "mistralai/Voxtral-Mini-4B-Realtime-2602" --trust-remote-code --host 0.0.0.0 --port 7899 --quantization fp8 --gpu-memory-utilization=0.24 --served-model-name model --max-model-len 4096 --quantization fp8 --compilation_config '{"cudagraph_mode": "PIECEWISE"}'

Always try with nightly version:

# vllm --version
0.1.dev1+g9942f5c50.precompiled

Sign up or log in to comment