Rope Scaling pre-applied?
Is rope scaling pre-applied to this?
I have 2 x A2000 that I am running Quen3-14b-AWQ without problem - however even with a very minimal setup (reduced context, no concurrency etc) I cannot get this to load at 8B AWQ.
Wondering what the difference is here?
No. I did nothing extra beside quantization via AutoAWQ.
So the model does not load at all? Your RTX A2000 have Ampere chips? I can try to get my hands on a A40 (also Ampere) and try to run it there. What are you using as inference server? vLLM?
Yeah it's weird, regardless of the settings and how much I ramp it down I get out of memory errors when trying which for 8B doesn't make sense when the 14B model at higher settings loads fine.
2 x A2000 12GB Ampere
I'm running vLLM - I'll try and grab some logs later if they help?
Ok even more weird - I just spun it up again to get logs and it has worked first time, must have been some weirdness at the time with something stuck in GPU memory that wasn't showing on nvidia-smi, I even rebooted previously to ensure that it was actually cleared!
The 14B model still fails with this above when the 14B Qwen model works fine - hopefully these logs will shed some light as I believe it should fit - reducing the settings does nothing to change this one - this was the same issue I was getting with the 8B previously.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 1.26it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.04it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.07it/s]
vllm | (VllmWorker rank=0 pid=65)
vllm | (VllmWorker rank=0 pid=65) INFO 07-19 06:46:18 [default_loader.py:272] Loading weights took 1.94 seconds
vllm | (VllmWorker rank=1 pid=66) INFO 07-19 06:46:18 [default_loader.py:272] Loading weights took 2.81 seconds
vllm | (VllmWorker rank=0 pid=65) INFO 07-19 06:46:19 [gpu_model_runner.py:1801] Model loading took 4.6793 GiB and 5.267483 seconds
vllm | (VllmWorker rank=1 pid=66) INFO 07-19 06:46:19 [gpu_model_runner.py:1801] Model loading took 4.6793 GiB and 5.695640 seconds
vllm | (VllmWorker rank=1 pid=66) INFO 07-19 06:46:35 [backends.py:508] Using cache directory: /root/.cache/vllm/torch_compile_cache/b83644cca3/rank_1_0/backbone for vLLM's torch.compile
vllm | (VllmWorker rank=1 pid=66) INFO 07-19 06:46:35 [backends.py:519] Dynamo bytecode transform time: 15.40 s
vllm | (VllmWorker rank=0 pid=65) INFO 07-19 06:46:35 [backends.py:508] Using cache directory: /root/.cache/vllm/torch_compile_cache/b83644cca3/rank_0_0/backbone for vLLM's torch.compile
vllm | (VllmWorker rank=0 pid=65) INFO 07-19 06:46:35 [backends.py:519] Dynamo bytecode transform time: 15.89 s
vllm | (VllmWorker rank=1 pid=66) INFO 07-19 06:46:40 [backends.py:181] Cache the graph of shape None for later use
vllm | (VllmWorker rank=0 pid=65) INFO 07-19 06:46:40 [backends.py:181] Cache the graph of shape None for later use
vllm | (VllmWorker rank=1 pid=66) INFO 07-19 06:47:33 [backends.py:193] Compiling a graph for general shape takes 57.29 s
vllm | (VllmWorker rank=0 pid=65) INFO 07-19 06:47:34 [backends.py:193] Compiling a graph for general shape takes 57.71 s
vllm | (VllmWorker rank=1 pid=66) INFO 07-19 06:48:35 [monitor.py:34] torch.compile takes 72.69 s in total
vllm | (VllmWorker rank=0 pid=65) INFO 07-19 06:48:35 [monitor.py:34] torch.compile takes 73.59 s in total
vllm | (VllmWorker rank=1 pid=66) INFO 07-19 06:48:37 [gpu_worker.py:232] Available KV cache memory: 5.63 GiB
vllm | (VllmWorker rank=0 pid=65) INFO 07-19 06:48:37 [gpu_worker.py:232] Available KV cache memory: 4.96 GiB
vllm | ERROR 07-19 06:48:37 [core.py:586] EngineCore failed to start.
vllm | ERROR 07-19 06:48:37 [core.py:586] Traceback (most recent call last):
vllm | ERROR 07-19 06:48:37 [core.py:586] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 577, in run_engine_core
vllm | ERROR 07-19 06:48:37 [core.py:586] engine_core = EngineCoreProc(*args, **kwargs)
vllm | ERROR 07-19 06:48:37 [core.py:586] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 404, in __init__
vllm | ERROR 07-19 06:48:37 [core.py:586] super().__init__(vllm_config, executor_class, log_stats,
vllm | ERROR 07-19 06:48:37 [core.py:586] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 82, in __init__
vllm | ERROR 07-19 06:48:37 [core.py:586] self._initialize_kv_caches(vllm_config)
vllm | ERROR 07-19 06:48:37 [core.py:586] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 146, in _initialize_kv_caches
vllm | ERROR 07-19 06:48:37 [core.py:586] kv_cache_configs = [
vllm | ERROR 07-19 06:48:37 [core.py:586] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 147, in <listcomp>
vllm | ERROR 07-19 06:48:37 [core.py:586] get_kv_cache_config(vllm_config, kv_cache_spec_one_worker,
vllm | ERROR 07-19 06:48:37 [core.py:586] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/core/kv_cache_utils.py", line 943, in get_kv_cache_config
vllm | ERROR 07-19 06:48:37 [core.py:586] check_enough_kv_cache_memory(vllm_config, kv_cache_spec, available_memory)
vllm | ERROR 07-19 06:48:37 [core.py:586] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/core/kv_cache_utils.py", line 572, in check_enough_kv_cache_memory
vllm | ERROR 07-19 06:48:37 [core.py:586] raise ValueError(
vllm | ERROR 07-19 06:48:37 [core.py:586] ValueError: To serve at least one request with the models's max seq len (57344), (5.25 GiB KV cache is needed, which is larger than the available KV cache memory (4.96 GiB). Based on the available memory, the estimated maximum model length is 54160. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
vllm | ERROR 07-19 06:48:39 [multiproc_executor.py:135] Worker proc VllmWorker-0 died unexpectedly, shutting down executor.
vllm | Process EngineCore_0:
vllm | Traceback (most recent call last):
vllm | File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
vllm | self.run()
vllm | File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
vllm | self._target(*self._args, **self._kwargs)
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 590, in run_engine_core
vllm | raise e
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 577, in run_engine_core
vllm | engine_core = EngineCoreProc(*args, **kwargs)
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 404, in __init__
vllm | super().__init__(vllm_config, executor_class, log_stats,
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 82, in __init__
vllm | self._initialize_kv_caches(vllm_config)
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 146, in _initialize_kv_caches
vllm | kv_cache_configs = [
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 147, in <listcomp>
vllm | get_kv_cache_config(vllm_config, kv_cache_spec_one_worker,
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/core/kv_cache_utils.py", line 943, in get_kv_cache_config
vllm | check_enough_kv_cache_memory(vllm_config, kv_cache_spec, available_memory)
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/core/kv_cache_utils.py", line 572, in check_enough_kv_cache_memory
vllm | raise ValueError(
vllm | ValueError: To serve at least one request with the models's max seq len (57344), (5.25 GiB KV cache is needed, which is larger than the available KV cache memory (4.96 GiB). Based on the available memory, the estimated maximum model length is 54160. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
vllm | Traceback (most recent call last):
vllm | File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
vllm | return _run_code(code, main_globals, None,
vllm | File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
vllm | exec(code, run_globals)
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 1495, in <module>
vllm | uvloop.run(run_server(args))
vllm | File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
vllm | return loop.run_until_complete(wrapper())
vllm | File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
vllm | File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
vllm | return await main
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 1431, in run_server
vllm | await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 1451, in run_server_worker
vllm | async with build_async_engine_client(args, client_config) as engine_client:
vllm | File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
vllm | return await anext(self.gen)
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client
vllm | async with build_async_engine_client_from_engine_args(
vllm | File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
vllm | return await anext(self.gen)
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
vllm | async_llm = AsyncLLM.from_vllm_config(
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/async_llm.py", line 162, in from_vllm_config
vllm | return cls(
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/async_llm.py", line 124, in __init__
vllm | self.engine_core = EngineCoreClient.make_async_mp_client(
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 96, in make_async_mp_client
vllm | return AsyncMPClient(*client_args)
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 666, in __init__
vllm | super().__init__(
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 403, in __init__
vllm | with launch_core_engines(vllm_config, executor_class,
vllm | File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
vllm | next(self.gen)
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/utils.py", line 434, in launch_core_engines
vllm | wait_for_engine_startup(
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/utils.py", line 484, in wait_for_engine_startup
vllm | raise RuntimeError("Engine core initialization failed. "
vllm | RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
vllm exited with code 1
Even massively reducing the max_tokens, max_num_seqs and increasing memory utilization to 0.98 it fails - in fact even if I reduce these down to like 4k it fails.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 1.35it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.22it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.24it/s]
vllm | (VllmWorker rank=0 pid=65)
vllm | (VllmWorker rank=0 pid=65) INFO 07-19 06:52:23 [default_loader.py:272] Loading weights took 1.69 seconds
vllm | (VllmWorker rank=0 pid=65) INFO 07-19 06:52:24 [gpu_model_runner.py:1801] Model loading took 4.6793 GiB and 3.425591 seconds
vllm | (VllmWorker rank=1 pid=66) INFO 07-19 06:52:24 [default_loader.py:272] Loading weights took 2.57 seconds
vllm | (VllmWorker rank=1 pid=66) INFO 07-19 06:52:25 [gpu_model_runner.py:1801] Model loading took 4.6793 GiB and 4.789749 seconds
vllm | (VllmWorker rank=1 pid=66) INFO 07-19 06:52:41 [backends.py:508] Using cache directory: /root/.cache/vllm/torch_compile_cache/17f8b63333/rank_1_0/backbone for vLLM's torch.compile
vllm | (VllmWorker rank=1 pid=66) INFO 07-19 06:52:41 [backends.py:519] Dynamo bytecode transform time: 15.38 s
vllm | (VllmWorker rank=0 pid=65) INFO 07-19 06:52:41 [backends.py:508] Using cache directory: /root/.cache/vllm/torch_compile_cache/17f8b63333/rank_0_0/backbone for vLLM's torch.compile
vllm | (VllmWorker rank=0 pid=65) INFO 07-19 06:52:41 [backends.py:519] Dynamo bytecode transform time: 15.73 s
vllm | (VllmWorker rank=1 pid=66) INFO 07-19 06:52:46 [backends.py:181] Cache the graph of shape None for later use
vllm | (VllmWorker rank=0 pid=65) INFO 07-19 06:52:46 [backends.py:181] Cache the graph of shape None for later use
vllm | (VllmWorker rank=1 pid=66) INFO 07-19 06:53:40 [backends.py:193] Compiling a graph for general shape takes 57.81 s
vllm | (VllmWorker rank=0 pid=65) INFO 07-19 06:53:40 [backends.py:193] Compiling a graph for general shape takes 57.89 s
vllm | (VllmWorker rank=1 pid=66) INFO 07-19 06:54:40 [monitor.py:34] torch.compile takes 73.19 s in total
vllm | (VllmWorker rank=0 pid=65) INFO 07-19 06:54:40 [monitor.py:34] torch.compile takes 73.62 s in total
vllm | (VllmWorker rank=1 pid=66) INFO 07-19 06:54:41 [gpu_worker.py:232] Available KV cache memory: 6.57 GiB
vllm | (VllmWorker rank=0 pid=65) INFO 07-19 06:54:41 [gpu_worker.py:232] Available KV cache memory: 5.84 GiB
vllm | INFO 07-19 06:54:41 [kv_cache_utils.py:716] GPU KV cache size: 63,776 tokens
vllm | INFO 07-19 06:54:41 [kv_cache_utils.py:720] Maximum concurrency for 38,912 tokens per request: 1.64x
vllm | INFO 07-19 06:54:41 [kv_cache_utils.py:716] GPU KV cache size: 71,760 tokens
vllm | INFO 07-19 06:54:41 [kv_cache_utils.py:720] Maximum concurrency for 38,912 tokens per request: 1.84x
Capturing CUDA graph shapes: 7%|▋ | 5/67 [00:04<00:59, 1.03it/s]
vllm | (VllmWorker rank=0 pid=65) ERROR 07-19 06:54:46 [multiproc_executor.py:522] WorkerProc hit an exception.
vllm | (VllmWorker rank=0 pid=65) ERROR 07-19 06:54:46 [multiproc_executor.py:522] Traceback (most recent call last):
vllm | (VllmWorker rank=0 pid=65) ERROR 07-19 06:54:46 [multiproc_executor.py:522] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 517, in worker_busy_loop
vllm | (VllmWorker rank=0 pid=65) ERROR 07-19 06:54:46 [multiproc_executor.py:522] output = func(*args, **kwargs)
vllm | (VllmWorker rank=0 pid=65) ERROR 07-19 06:54:46 [multiproc_executor.py:522] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_worker.py", line 267, in compile_or_warm_up_model
vllm | (VllmWorker rank=0 pid=65) ERROR 07-19 06:54:46 [multiproc_executor.py:522] self.model_runner.capture_model()
vllm | (VllmWorker rank=0 pid=65) ERROR 07-19 06:54:46 [multiproc_executor.py:522] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2314, in capture_model
vllm | (VllmWorker rank=0 pid=65) ERROR 07-19 06:54:46 [multiproc_executor.py:522] self._dummy_run(num_tokens,
vllm | (VllmWorker rank=0 pid=65) ERROR 07-19 06:54:46 [multiproc_executor.py:522] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
vllm | (VllmWorker rank=0 pid=65) ERROR 07-19 06:54:46 [multiproc_executor.py:522] return func(*args, **kwargs)
vllm | (VllmWorker rank=0 pid=65) ERROR 07-19 06:54:46 [multiproc_executor.py:522] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2083, in _dummy_run
vllm | (VllmWorker rank=0 pid=65) ERROR 07-19 06:54:46 [multiproc_executor.py:522] return hidden_states, hidden_states[logit_indices]
vllm | (VllmWorker rank=0 pid=65) ERROR 07-19 06:54:46 [multiproc_executor.py:522] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 11.00 GiB of which 2.81 MiB is free. Process 10960 has 11.00 GiB memory in use. Of the allocated memory 10.61 GiB is allocated by PyTorch, with 62.00 MiB allocated in private pools (e.g., CUDA Graphs), and 100.59 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
vllm | ERROR 07-19 06:54:46 [core.py:586] EngineCore failed to start.
vllm | ERROR 07-19 06:54:46 [core.py:586] Traceback (most recent call last):
vllm | ERROR 07-19 06:54:46 [core.py:586] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 577, in run_engine_core
vllm | ERROR 07-19 06:54:46 [core.py:586] engine_core = EngineCoreProc(*args, **kwargs)
vllm | ERROR 07-19 06:54:46 [core.py:586] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 404, in __init__
vllm | ERROR 07-19 06:54:46 [core.py:586] super().__init__(vllm_config, executor_class, log_stats,
vllm | ERROR 07-19 06:54:46 [core.py:586] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 82, in __init__
vllm | ERROR 07-19 06:54:46 [core.py:586] self._initialize_kv_caches(vllm_config)
vllm | ERROR 07-19 06:54:46 [core.py:586] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 169, in _initialize_kv_caches
vllm | ERROR 07-19 06:54:46 [core.py:586] self.model_executor.initialize_from_config(kv_cache_configs)
vllm | ERROR 07-19 06:54:46 [core.py:586] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/abstract.py", line 66, in initialize_from_config
vllm | ERROR 07-19 06:54:46 [core.py:586] self.collective_rpc("compile_or_warm_up_model")
vllm | ERROR 07-19 06:54:46 [core.py:586] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 215, in collective_rpc
vllm | ERROR 07-19 06:54:46 [core.py:586] result = get_response(w, dequeue_timeout)
vllm | ERROR 07-19 06:54:46 [core.py:586] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 202, in get_response
vllm | ERROR 07-19 06:54:46 [core.py:586] raise RuntimeError(
vllm | ERROR 07-19 06:54:46 [core.py:586] RuntimeError: Worker failed with error 'CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 11.00 GiB of which 2.81 MiB is free. Process 10960 has 11.00 GiB memory in use. Of the allocated memory 10.61 GiB is allocated by PyTorch, with 62.00 MiB allocated in private pools (e.g., CUDA Graphs), and 100.59 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)', please check the stack trace above for the root cause
vllm | ERROR 07-19 06:54:48 [multiproc_executor.py:135] Worker proc VllmWorker-1 died unexpectedly, shutting down executor.
vllm | Process EngineCore_0:
vllm | Traceback (most recent call last):
vllm | File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
vllm | self.run()
vllm | File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
vllm | self._target(*self._args, **self._kwargs)
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 590, in run_engine_core
vllm | raise e
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 577, in run_engine_core
vllm | engine_core = EngineCoreProc(*args, **kwargs)
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 404, in __init__
vllm | super().__init__(vllm_config, executor_class, log_stats,
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 82, in __init__
vllm | self._initialize_kv_caches(vllm_config)
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 169, in _initialize_kv_caches
vllm | self.model_executor.initialize_from_config(kv_cache_configs)
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/abstract.py", line 66, in initialize_from_config
vllm | self.collective_rpc("compile_or_warm_up_model")
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 215, in collective_rpc
vllm | result = get_response(w, dequeue_timeout)
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 202, in get_response
vllm | raise RuntimeError(
vllm | RuntimeError: Worker failed with error 'CUDA out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 11.00 GiB of which 2.81 MiB is free. Process 10960 has 11.00 GiB memory in use. Of the allocated memory 10.61 GiB is allocated by PyTorch, with 62.00 MiB allocated in private pools (e.g., CUDA Graphs), and 100.59 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)', please check the stack trace above for the root cause
vllm | Traceback (most recent call last):
vllm | File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
vllm | return _run_code(code, main_globals, None,
vllm | File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
vllm | exec(code, run_globals)
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 1495, in <module>
vllm | uvloop.run(run_server(args))
vllm | File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
vllm | return loop.run_until_complete(wrapper())
vllm | File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
vllm | File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
vllm | return await main
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 1431, in run_server
vllm | await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 1451, in run_server_worker
vllm | async with build_async_engine_client(args, client_config) as engine_client:
vllm | File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
vllm | return await anext(self.gen)
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client
vllm | async with build_async_engine_client_from_engine_args(
vllm | File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
vllm | return await anext(self.gen)
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
vllm | async_llm = AsyncLLM.from_vllm_config(
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/async_llm.py", line 162, in from_vllm_config
vllm | return cls(
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/async_llm.py", line 124, in __init__
vllm | self.engine_core = EngineCoreClient.make_async_mp_client(
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 96, in make_async_mp_client
vllm | return AsyncMPClient(*client_args)
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 666, in __init__
vllm | super().__init__(
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 403, in __init__
vllm | with launch_core_engines(vllm_config, executor_class,
vllm | File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
vllm | next(self.gen)
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/utils.py", line 434, in launch_core_engines
vllm | wait_for_engine_startup(
vllm | File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/utils.py", line 484, in wait_for_engine_startup
vllm | raise RuntimeError("Engine core initialization failed. "
vllm | RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
vllm exited with code 1
Which version of vLLM do you use? v0.9.2? v0.9.2 did work for me for most of the models. With v0.10.0 I observe some OOMs for small models (Llama 3.1 8B - no quant - for benchmarks) when trying to run it on a RTX 5090 to test the status og Blackwell support.
So I have come back to this after some effort with VLLM rebasing all of their 0.9x branches onto a version of CUDA that my system cant support.
This is working fine with 0.9.2 and I get 50t/s on my 2 x RTX A2000's which is pretty decent.
Im currently trying to compile 0.10.0 and FLASHINFER to see if I can squeeze some more out of it but whatever the issue was previously seems to not exist any more, not sure if it was a blip or something else...