AttributeError: 'FusedMoE' object has no attribute 'moe'

#1
by kq - opened

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 && export CUDA_VISIBLE_DEVICES=0,1,2,3 && vllm serve /home/deaf/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound --port 12303 --gpu-memory-utilization 0.87 --dtype float16 --tensor-parallel-size 4 --max-model-len 131072 --max-seq-len-to-capture 131072 --api-key token-deaf --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes --served-model-name qwen3-next-80b
INFO 09-14 13:16:42 [init.py:216] Automatically detected platform cuda.
(APIServer pid=39871) INFO 09-14 13:16:46 [api_server.py:1896] vLLM API server version 0.10.2
(APIServer pid=39871) INFO 09-14 13:16:46 [utils.py:328] non-default args: {'model_tag': '/home/deaf/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound', 'port': 12303, 'api_key': ['token-deaf'], 'enable_auto_tool_choice': True, 'tool_call_parser': 'hermes', 'model': '/home/deaf/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound', 'dtype': 'float16', 'max_model_len': 131072, 'max_seq_len_to_capture': 131072, 'served_model_name': ['qwen3-next-80b'], 'reasoning_parser': 'deepseek_r1', 'tensor_parallel_size': 4, 'gpu_memory_utilization': 0.87}
(APIServer pid=39871) INFO 09-14 13:16:57 [init.py:742] Resolved architecture: Qwen3NextForCausalLM
(APIServer pid=39871) torch_dtype is deprecated! Use dtype instead!
(APIServer pid=39871) WARNING 09-14 13:16:57 [init.py:2767] Casting torch.bfloat16 to torch.float16.
(APIServer pid=39871) INFO 09-14 13:16:57 [init.py:1815] Using max model len 131072
(APIServer pid=39871) WARNING 09-14 13:16:57 [_ipex_ops.py:16] Import error msg: No module named 'intel_extension_for_pytorch'
(APIServer pid=39871) WARNING 09-14 13:16:57 [init.py:1217] auto-round quantization is not fully optimized yet. The speed can be slower than non-quantized models.
(APIServer pid=39871) INFO 09-14 13:16:57 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=39871) INFO 09-14 13:16:57 [config.py:310] Hybrid or mamba-based model detected: disabling prefix caching since it is not yet supported.
(APIServer pid=39871) INFO 09-14 13:16:57 [config.py:321] Hybrid or mamba-based model detected: setting cudagraph mode to FULL_AND_PIECEWISE in order to optimize performance.
(APIServer pid=39871) INFO 09-14 13:16:59 [config.py:390] Setting attention block size to 272 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=39871) INFO 09-14 13:16:59 [config.py:411] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.
INFO 09-14 13:17:05 [init.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=40052) INFO 09-14 13:17:09 [core.py:654] Waiting for init message from front-end.
(EngineCore_DP0 pid=40052) INFO 09-14 13:17:10 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='/home/deaf/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound', speculative_config=None, tokenizer='/home/deaf/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=auto-round, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend='deepseek_r1'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=qwen3-next-80b, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,1],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_DP0 pid=40052) WARNING 09-14 13:17:10 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 18 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=40052) INFO 09-14 13:17:10 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, 'psm_6600501d'), local_subscribe_addr='ipc:///tmp/d3863f7d-af81-4019-a0fb-e0056707c242', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-14 13:17:16 [init.py:216] Automatically detected platform cuda.
INFO 09-14 13:17:16 [init.py:216] Automatically detected platform cuda.
INFO 09-14 13:17:16 [init.py:216] Automatically detected platform cuda.
INFO 09-14 13:17:16 [init.py:216] Automatically detected platform cuda.
W0914 13:17:20.965000 40137 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0914 13:17:20.965000 40137 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0914 13:17:20.993000 40136 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0914 13:17:20.993000 40136 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0914 13:17:21.040000 40134 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0914 13:17:21.040000 40134 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0914 13:17:21.058000 40135 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0914 13:17:21.058000 40135 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 09-14 13:17:22 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_e3bd759f'), local_subscribe_addr='ipc:///tmp/878c9eab-0854-4d4c-aa99-6331ac35fc70', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-14 13:17:22 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_4d4244f2'), local_subscribe_addr='ipc:///tmp/04e6e1bb-ee7a-46d7-8f90-5cc525f459d4', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-14 13:17:22 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_5a424709'), local_subscribe_addr='ipc:///tmp/1c74404b-1df7-46fe-8ed9-a169ab5d95e1', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-14 13:17:22 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_ee394153'), local_subscribe_addr='ipc:///tmp/856dd7e1-ae67-4557-84c2-f89711e705b9', remote_subscribe_addr=None, remote_addr_ipv6=False)
[W914 13:17:23.476871468 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W914 13:17:23.573267660 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W914 13:17:23.812498308 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W914 13:17:23.819925860 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 09-14 13:17:23 [init.py:1433] Found nccl from library libnccl.so.2
INFO 09-14 13:17:23 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-14 13:17:23 [init.py:1433] Found nccl from library libnccl.so.2
INFO 09-14 13:17:23 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-14 13:17:23 [init.py:1433] Found nccl from library libnccl.so.2
INFO 09-14 13:17:23 [init.py:1433] Found nccl from library libnccl.so.2
INFO 09-14 13:17:23 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-14 13:17:23 [pynccl.py:70] vLLM is using nccl==2.27.3
WARNING 09-14 13:17:24 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 09-14 13:17:24 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 09-14 13:17:24 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 09-14 13:17:24 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 09-14 13:17:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_5ce31d9a'), local_subscribe_addr='ipc:///tmp/ee1acf9b-3da4-4291-bc39-5b993bb6ac98', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 09-14 13:17:24 [parallel_state.py:1165] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 09-14 13:17:24 [parallel_state.py:1165] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 09-14 13:17:24 [parallel_state.py:1165] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
INFO 09-14 13:17:24 [parallel_state.py:1165] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
INFO 09-14 13:17:24 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
INFO 09-14 13:17:24 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
INFO 09-14 13:17:24 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
INFO 09-14 13:17:24 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
(Worker_TP2 pid=40136) INFO 09-14 13:17:24 [gpu_model_runner.py:2338] Starting to load model /home/deaf/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound...
(Worker_TP1 pid=40135) INFO 09-14 13:17:24 [gpu_model_runner.py:2338] Starting to load model /home/deaf/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound...
(Worker_TP3 pid=40137) INFO 09-14 13:17:24 [gpu_model_runner.py:2338] Starting to load model /home/deaf/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound...
(Worker_TP0 pid=40134) INFO 09-14 13:17:24 [gpu_model_runner.py:2338] Starting to load model /home/deaf/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound...
(Worker_TP1 pid=40135) INFO 09-14 13:17:24 [gpu_model_runner.py:2370] Loading model from scratch...
(Worker_TP2 pid=40136) INFO 09-14 13:17:24 [gpu_model_runner.py:2370] Loading model from scratch...
(Worker_TP3 pid=40137) INFO 09-14 13:17:24 [gpu_model_runner.py:2370] Loading model from scratch...
(Worker_TP0 pid=40134) INFO 09-14 13:17:24 [gpu_model_runner.py:2370] Loading model from scratch...
(Worker_TP1 pid=40135) INFO 09-14 13:17:25 [gptq_marlin.py:269] Using BitBLASLinearKernel for GPTQMarlinLinearMethod
(Worker_TP1 pid=40135) torch_dtype is deprecated! Use dtype instead!
(Worker_TP2 pid=40136) INFO 09-14 13:17:25 [gptq_marlin.py:269] Using BitBLASLinearKernel for GPTQMarlinLinearMethod
(Worker_TP1 pid=40135) INFO 09-14 13:17:25 [gptq_marlin.py:269] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(Worker_TP2 pid=40136) torch_dtype is deprecated! Use dtype instead!
(Worker_TP2 pid=40136) INFO 09-14 13:17:25 [gptq_marlin.py:269] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(Worker_TP0 pid=40134) INFO 09-14 13:17:25 [gptq_marlin.py:269] Using BitBLASLinearKernel for GPTQMarlinLinearMethod
(Worker_TP0 pid=40134) torch_dtype is deprecated! Use dtype instead!
(Worker_TP3 pid=40137) INFO 09-14 13:17:25 [gptq_marlin.py:269] Using BitBLASLinearKernel for GPTQMarlinLinearMethod
(Worker_TP0 pid=40134) INFO 09-14 13:17:25 [gptq_marlin.py:269] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(Worker_TP3 pid=40137) torch_dtype is deprecated! Use dtype instead!
(Worker_TP3 pid=40137) INFO 09-14 13:17:25 [gptq_marlin.py:269] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] WorkerProc failed to start.
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] Traceback (most recent call last):
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 559, in worker_main
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] worker = WorkerProc(*args, **kwargs)
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 427, in init
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.worker.load_model()
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 213, in load_model
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2371, in load_model
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.model = model_loader.load_model(
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 45, in load_model
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] model = initialize_model(vllm_config=vllm_config,
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 64, in initialize_model
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] return model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1079, in init
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.model = Qwen3NextModel(vllm_config=vllm_config,
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 199, in init
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 915, in init
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.start_layer, self.end_layer, self.layers = make_layers(
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 643, in make_layers
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 904, in get_layer
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] return Qwen3NextDecoderLayer(
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 782, in init
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.mlp = Qwen3NextSparseMoeBlock(
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 115, in init
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.experts = FusedMoE(num_experts=self.n_routed_experts,
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 909, in init
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] else quant_config.get_quant_method(self, prefix))
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/auto_round.py", line 386, in get_quant_method
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] return self.apply_gptq_quant_layer(layer, prefix)
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/auto_round.py", line 330, in apply_gptq_quant_layer
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] return GPTQMarlinMoEMethod(quant_args_marlin, layer.moe)
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1962, in getattr
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] raise AttributeError(
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] AttributeError: 'FusedMoE' object has no attribute 'moe'
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] WorkerProc failed to start.
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] Traceback (most recent call last):
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 559, in worker_main
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] worker = WorkerProc(*args, **kwargs)
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 427, in init
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.worker.load_model()
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 213, in load_model
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2371, in load_model
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.model = model_loader.load_model(
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 45, in load_model
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] model = initialize_model(vllm_config=vllm_config,
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 64, in initialize_model
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] return model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1079, in init
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.model = Qwen3NextModel(vllm_config=vllm_config,
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 199, in init
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 915, in init
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.start_layer, self.end_layer, self.layers = make_layers(
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 643, in make_layers
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 904, in get_layer
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] return Qwen3NextDecoderLayer(
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 782, in init
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.mlp = Qwen3NextSparseMoeBlock(
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 115, in init
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.experts = FusedMoE(num_experts=self.n_routed_experts,
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 909, in init
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] else quant_config.get_quant_method(self, prefix))
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/auto_round.py", line 386, in get_quant_method
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] return self.apply_gptq_quant_layer(layer, prefix)
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/auto_round.py", line 330, in apply_gptq_quant_layer
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] return GPTQMarlinMoEMethod(quant_args_marlin, layer.moe)
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1962, in getattr
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] raise AttributeError(
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] AttributeError: 'FusedMoE' object has no attribute 'moe'
(Worker_TP2 pid=40136) INFO 09-14 13:17:26 [multiproc_executor.py:546] Parent process exited, terminating worker
(Worker_TP1 pid=40135) INFO 09-14 13:17:26 [multiproc_executor.py:546] Parent process exited, terminating worker
(Worker_TP3 pid=40137) INFO 09-14 13:17:26 [multiproc_executor.py:546] Parent process exited, terminating worker
(Worker_TP0 pid=40134) INFO 09-14 13:17:26 [multiproc_executor.py:546] Parent process exited, terminating worker
[rank0]:[W914 13:17:27.065687476 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] EngineCore failed to start.
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] Traceback (most recent call last):
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 505, in init
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] super().init(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in init
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] self._init_executor()
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 99, in _init_executor
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 497, in wait_for_ready
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] raise e from None
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=40052) Process EngineCore_DP0:
(EngineCore_DP0 pid=40052) Traceback (most recent call last):
(EngineCore_DP0 pid=40052) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=40052) self.run()
(EngineCore_DP0 pid=40052) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=40052) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=40052) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_DP0 pid=40052) raise e
(EngineCore_DP0 pid=40052) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_DP0 pid=40052) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=40052) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=40052) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 505, in init
(EngineCore_DP0 pid=40052) super().init(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=40052) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in init
(EngineCore_DP0 pid=40052) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=40052) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=40052) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_DP0 pid=40052) self._init_executor()
(EngineCore_DP0 pid=40052) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 99, in _init_executor
(EngineCore_DP0 pid=40052) self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=40052) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=40052) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 497, in wait_for_ready
(EngineCore_DP0 pid=40052) raise e from None
(EngineCore_DP0 pid=40052) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=39871) Traceback (most recent call last):
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/bin/vllm", line 8, in
(APIServer pid=39871) sys.exit(main())
(APIServer pid=39871) ^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=39871) args.dispatch_function(args)
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=39871) uvloop.run(run_server(args))
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
(APIServer pid=39871) return __asyncio.run(
(APIServer pid=39871) ^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=39871) return runner.run(main)
(APIServer pid=39871) ^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=39871) return self._loop.run_until_complete(task)
(APIServer pid=39871) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
(APIServer pid=39871) return await main
(APIServer pid=39871) ^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1941, in run_server
(APIServer pid=39871) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1961, in run_server_worker
(APIServer pid=39871) async with build_async_engine_client(
(APIServer pid=39871) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=39871) return await anext(self.gen)
(APIServer pid=39871) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 179, in build_async_engine_client
(APIServer pid=39871) async with build_async_engine_client_from_engine_args(
(APIServer pid=39871) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=39871) return await anext(self.gen)
(APIServer pid=39871) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 221, in build_async_engine_client_from_engine_args
(APIServer pid=39871) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=39871) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/utils/init.py", line 1589, in inner
(APIServer pid=39871) return fn(*args, **kwargs)
(APIServer pid=39871) ^^^^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 212, in from_vllm_config
(APIServer pid=39871) return cls(
(APIServer pid=39871) ^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 136, in init
(APIServer pid=39871) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=39871) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=39871) return AsyncMPClient(*client_args)
(APIServer pid=39871) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 769, in init
(APIServer pid=39871) super().init(
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 448, in init
(APIServer pid=39871) with launch_core_engines(vllm_config, executor_class,
(APIServer pid=39871) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/contextlib.py", line 144, in exit
(APIServer pid=39871) next(self.gen)
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 729, in launch_core_engines
(APIServer pid=39871) wait_for_engine_startup(
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 782, in wait_for_engine_startup
(APIServer pid=39871) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=39871) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/home/deaf/miniconda3/envs/vllm/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

P.S.
(vllm) deaf@rtxserver:$ vllm -v
INFO 09-14 13:21:47 [init.py:216] Automatically detected platform cuda.
0.10.2
(vllm) deaf@rtxserver:
$ uname -a
Linux rtxserver 6.8.0-79-generic #79-Ubuntu SMP PREEMPT_DYNAMIC Tue Aug 12 14:42:46 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Seems something wrong with the network definition. Am I missed somthing?

Intel org

Thank you. I will just wait next vllm release to retest this.

Sign up or log in to comment