AttributeError: 'FusedMoE' object has no attribute 'moe'

by kq - opened Sep 14

Sep 14

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 && export CUDA_VISIBLE_DEVICES=0,1,2,3 && vllm serve /home/deaf/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound --port 12303 --gpu-memory-utilization 0.87 --dtype float16 --tensor-parallel-size 4 --max-model-len 131072 --max-seq-len-to-capture 131072 --api-key token-deaf --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes --served-model-name qwen3-next-80b
INFO 09-14 13:16:42 [init.py:216] Automatically detected platform cuda.
(APIServer pid=39871) INFO 09-14 13:16:46 [api_server.py:1896] vLLM API server version 0.10.2
(APIServer pid=39871) INFO 09-14 13:16:46 [utils.py:328] non-default args: {'model_tag': '/home/deaf/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound', 'port': 12303, 'api_key': ['token-deaf'], 'enable_auto_tool_choice': True, 'tool_call_parser': 'hermes', 'model': '/home/deaf/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound', 'dtype': 'float16', 'max_model_len': 131072, 'max_seq_len_to_capture': 131072, 'served_model_name': ['qwen3-next-80b'], 'reasoning_parser': 'deepseek_r1', 'tensor_parallel_size': 4, 'gpu_memory_utilization': 0.87}
(APIServer pid=39871) INFO 09-14 13:16:57 [init.py:742] Resolved architecture: Qwen3NextForCausalLM
(APIServer pid=39871) torch_dtype is deprecated! Use dtype instead!
(APIServer pid=39871) WARNING 09-14 13:16:57 [init.py:2767] Casting torch.bfloat16 to torch.float16.
(APIServer pid=39871) INFO 09-14 13:16:57 [init.py:1815] Using max model len 131072
(APIServer pid=39871) WARNING 09-14 13:16:57 [_ipex_ops.py:16] Import error msg: No module named 'intel_extension_for_pytorch'
(APIServer pid=39871) WARNING 09-14 13:16:57 [init.py:1217] auto-round quantization is not fully optimized yet. The speed can be slower than non-quantized models.
(APIServer pid=39871) INFO 09-14 13:16:57 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=39871) INFO 09-14 13:16:57 [config.py:310] Hybrid or mamba-based model detected: disabling prefix caching since it is not yet supported.
(APIServer pid=39871) INFO 09-14 13:16:57 [config.py:321] Hybrid or mamba-based model detected: setting cudagraph mode to FULL_AND_PIECEWISE in order to optimize performance.
(APIServer pid=39871) INFO 09-14 13:16:59 [config.py:390] Setting attention block size to 272 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=39871) INFO 09-14 13:16:59 [config.py:411] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.
INFO 09-14 13:17:05 [init.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=40052) INFO 09-14 13:17:09 [core.py:654] Waiting for init message from front-end.
(EngineCore_DP0 pid=40052) INFO 09-14 13:17:10 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='/home/deaf/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound', speculative_config=None, tokenizer='/home/deaf/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=auto-round, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend='deepseek_r1'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=qwen3-next-80b, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,1],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_DP0 pid=40052) WARNING 09-14 13:17:10 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 18 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=40052) INFO 09-14 13:17:10 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, 'psm_6600501d'), local_subscribe_addr='ipc:///tmp/d3863f7d-af81-4019-a0fb-e0056707c242', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-14 13:17:16 [init.py:216] Automatically detected platform cuda.
INFO 09-14 13:17:16 [init.py:216] Automatically detected platform cuda.
INFO 09-14 13:17:16 [init.py:216] Automatically detected platform cuda.
INFO 09-14 13:17:16 [init.py:216] Automatically detected platform cuda.
W0914 13:17:20.965000 40137 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0914 13:17:20.965000 40137 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0914 13:17:20.993000 40136 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0914 13:17:20.993000 40136 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0914 13:17:21.040000 40134 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0914 13:17:21.040000 40134 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0914 13:17:21.058000 40135 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0914 13:17:21.058000 40135 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 09-14 13:17:22 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_e3bd759f'), local_subscribe_addr='ipc:///tmp/878c9eab-0854-4d4c-aa99-6331ac35fc70', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-14 13:17:22 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_4d4244f2'), local_subscribe_addr='ipc:///tmp/04e6e1bb-ee7a-46d7-8f90-5cc525f459d4', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-14 13:17:22 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_5a424709'), local_subscribe_addr='ipc:///tmp/1c74404b-1df7-46fe-8ed9-a169ab5d95e1', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-14 13:17:22 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_ee394153'), local_subscribe_addr='ipc:///tmp/856dd7e1-ae67-4557-84c2-f89711e705b9', remote_subscribe_addr=None, remote_addr_ipv6=False)
[W914 13:17:23.476871468 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W914 13:17:23.573267660 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W914 13:17:23.812498308 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W914 13:17:23.819925860 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 09-14 13:17:23 [init.py:1433] Found nccl from library libnccl.so.2
INFO 09-14 13:17:23 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-14 13:17:23 [init.py:1433] Found nccl from library libnccl.so.2
INFO 09-14 13:17:23 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-14 13:17:23 [init.py:1433] Found nccl from library libnccl.so.2
INFO 09-14 13:17:23 [init.py:1433] Found nccl from library libnccl.so.2
INFO 09-14 13:17:23 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-14 13:17:23 [pynccl.py:70] vLLM is using nccl==2.27.3
WARNING 09-14 13:17:24 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 09-14 13:17:24 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 09-14 13:17:24 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 09-14 13:17:24 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 09-14 13:17:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_5ce31d9a'), local_subscribe_addr='ipc:///tmp/ee1acf9b-3da4-4291-bc39-5b993bb6ac98', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 09-14 13:17:24 [parallel_state.py:1165] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 09-14 13:17:24 [parallel_state.py:1165] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 09-14 13:17:24 [parallel_state.py:1165] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
INFO 09-14 13:17:24 [parallel_state.py:1165] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
INFO 09-14 13:17:24 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
INFO 09-14 13:17:24 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
INFO 09-14 13:17:24 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
INFO 09-14 13:17:24 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
(Worker_TP2 pid=40136) INFO 09-14 13:17:24 [gpu_model_runner.py:2338] Starting to load model /home/deaf/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound...
(Worker_TP1 pid=40135) INFO 09-14 13:17:24 [gpu_model_runner.py:2338] Starting to load model /home/deaf/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound...
(Worker_TP3 pid=40137) INFO 09-14 13:17:24 [gpu_model_runner.py:2338] Starting to load model /home/deaf/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound...
(Worker_TP0 pid=40134) INFO 09-14 13:17:24 [gpu_model_runner.py:2338] Starting to load model /home/deaf/Qwen3-Next-80B-A3B-Thinking-int4-mixed-AutoRound...
(Worker_TP1 pid=40135) INFO 09-14 13:17:24 [gpu_model_runner.py:2370] Loading model from scratch...
(Worker_TP2 pid=40136) INFO 09-14 13:17:24 [gpu_model_runner.py:2370] Loading model from scratch...
(Worker_TP3 pid=40137) INFO 09-14 13:17:24 [gpu_model_runner.py:2370] Loading model from scratch...
(Worker_TP0 pid=40134) INFO 09-14 13:17:24 [gpu_model_runner.py:2370] Loading model from scratch...
(Worker_TP1 pid=40135) INFO 09-14 13:17:25 [gptq_marlin.py:269] Using BitBLASLinearKernel for GPTQMarlinLinearMethod
(Worker_TP1 pid=40135) torch_dtype is deprecated! Use dtype instead!
(Worker_TP2 pid=40136) INFO 09-14 13:17:25 [gptq_marlin.py:269] Using BitBLASLinearKernel for GPTQMarlinLinearMethod
(Worker_TP1 pid=40135) INFO 09-14 13:17:25 [gptq_marlin.py:269] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(Worker_TP2 pid=40136) torch_dtype is deprecated! Use dtype instead!
(Worker_TP2 pid=40136) INFO 09-14 13:17:25 [gptq_marlin.py:269] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(Worker_TP0 pid=40134) INFO 09-14 13:17:25 [gptq_marlin.py:269] Using BitBLASLinearKernel for GPTQMarlinLinearMethod
(Worker_TP0 pid=40134) torch_dtype is deprecated! Use dtype instead!
(Worker_TP3 pid=40137) INFO 09-14 13:17:25 [gptq_marlin.py:269] Using BitBLASLinearKernel for GPTQMarlinLinearMethod
(Worker_TP0 pid=40134) INFO 09-14 13:17:25 [gptq_marlin.py:269] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(Worker_TP3 pid=40137) torch_dtype is deprecated! Use dtype instead!
(Worker_TP3 pid=40137) INFO 09-14 13:17:25 [gptq_marlin.py:269] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] WorkerProc failed to start.
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] Traceback (most recent call last):
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 559, in worker_main
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] worker = WorkerProc(*args, **kwargs)
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 427, in init
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.worker.load_model()
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 213, in load_model
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2371, in load_model
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.model = model_loader.load_model(
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 45, in load_model
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] model = initialize_model(vllm_config=vllm_config,
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 64, in initialize_model
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] return model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1079, in init
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.model = Qwen3NextModel(vllm_config=vllm_config,
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 199, in init
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 915, in init
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.start_layer, self.end_layer, self.layers = make_layers(
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 643, in make_layers
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 904, in get_layer
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] return Qwen3NextDecoderLayer(
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 782, in init
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.mlp = Qwen3NextSparseMoeBlock(
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 115, in init
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.experts = FusedMoE(num_experts=self.n_routed_experts,
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 909, in init
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] else quant_config.get_quant_method(self, prefix))
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/auto_round.py", line 386, in get_quant_method
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] return self.apply_gptq_quant_layer(layer, prefix)
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/auto_round.py", line 330, in apply_gptq_quant_layer
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] return GPTQMarlinMoEMethod(quant_args_marlin, layer.moe)
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1962, in getattr
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] raise AttributeError(
(Worker_TP1 pid=40135) ERROR 09-14 13:17:26 [multiproc_executor.py:585] AttributeError: 'FusedMoE' object has no attribute 'moe'
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] WorkerProc failed to start.
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] Traceback (most recent call last):
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 559, in worker_main
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] worker = WorkerProc(*args, **kwargs)
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 427, in init
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.worker.load_model()
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 213, in load_model
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2371, in load_model
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.model = model_loader.load_model(
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 45, in load_model
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] model = initialize_model(vllm_config=vllm_config,
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 64, in initialize_model
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] return model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1079, in init
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.model = Qwen3NextModel(vllm_config=vllm_config,
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 199, in init
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 915, in init
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.start_layer, self.end_layer, self.layers = make_layers(
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 643, in make_layers
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 904, in get_layer
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] return Qwen3NextDecoderLayer(
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 782, in init
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.mlp = Qwen3NextSparseMoeBlock(
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 115, in init
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] self.experts = FusedMoE(num_experts=self.n_routed_experts,
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 909, in init
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] else quant_config.get_quant_method(self, prefix))
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/auto_round.py", line 386, in get_quant_method
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] return self.apply_gptq_quant_layer(layer, prefix)
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/auto_round.py", line 330, in apply_gptq_quant_layer
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] return GPTQMarlinMoEMethod(quant_args_marlin, layer.moe)
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] ^^^^^^^^^
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1962, in getattr
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] raise AttributeError(
(Worker_TP2 pid=40136) ERROR 09-14 13:17:26 [multiproc_executor.py:585] AttributeError: 'FusedMoE' object has no attribute 'moe'
(Worker_TP2 pid=40136) INFO 09-14 13:17:26 [multiproc_executor.py:546] Parent process exited, terminating worker
(Worker_TP1 pid=40135) INFO 09-14 13:17:26 [multiproc_executor.py:546] Parent process exited, terminating worker
(Worker_TP3 pid=40137) INFO 09-14 13:17:26 [multiproc_executor.py:546] Parent process exited, terminating worker
(Worker_TP0 pid=40134) INFO 09-14 13:17:26 [multiproc_executor.py:546] Parent process exited, terminating worker
[rank0]:[W914 13:17:27.065687476 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] EngineCore failed to start.
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] Traceback (most recent call last):
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 505, in init
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] super().init(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in init
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] self._init_executor()
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 99, in _init_executor
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 497, in wait_for_ready
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] raise e from None
(EngineCore_DP0 pid=40052) ERROR 09-14 13:17:28 [core.py:718] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=40052) Process EngineCore_DP0:
(EngineCore_DP0 pid=40052) Traceback (most recent call last):
(EngineCore_DP0 pid=40052) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=40052) self.run()
(EngineCore_DP0 pid=40052) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=40052) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=40052) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_DP0 pid=40052) raise e
(EngineCore_DP0 pid=40052) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_DP0 pid=40052) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=40052) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=40052) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 505, in init
(EngineCore_DP0 pid=40052) super().init(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=40052) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in init
(EngineCore_DP0 pid=40052) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=40052) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=40052) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_DP0 pid=40052) self._init_executor()
(EngineCore_DP0 pid=40052) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 99, in _init_executor
(EngineCore_DP0 pid=40052) self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=40052) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=40052) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 497, in wait_for_ready
(EngineCore_DP0 pid=40052) raise e from None
(EngineCore_DP0 pid=40052) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=39871) Traceback (most recent call last):
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/bin/vllm", line 8, in
(APIServer pid=39871) sys.exit(main())
(APIServer pid=39871) ^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=39871) args.dispatch_function(args)
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=39871) uvloop.run(run_server(args))
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
(APIServer pid=39871) return __asyncio.run(
(APIServer pid=39871) ^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=39871) return runner.run(main)
(APIServer pid=39871) ^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=39871) return self._loop.run_until_complete(task)
(APIServer pid=39871) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
(APIServer pid=39871) return await main
(APIServer pid=39871) ^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1941, in run_server
(APIServer pid=39871) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1961, in run_server_worker
(APIServer pid=39871) async with build_async_engine_client(
(APIServer pid=39871) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=39871) return await anext(self.gen)
(APIServer pid=39871) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 179, in build_async_engine_client
(APIServer pid=39871) async with build_async_engine_client_from_engine_args(
(APIServer pid=39871) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=39871) return await anext(self.gen)
(APIServer pid=39871) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 221, in build_async_engine_client_from_engine_args
(APIServer pid=39871) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=39871) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/utils/init.py", line 1589, in inner
(APIServer pid=39871) return fn(*args, **kwargs)
(APIServer pid=39871) ^^^^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 212, in from_vllm_config
(APIServer pid=39871) return cls(
(APIServer pid=39871) ^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 136, in init
(APIServer pid=39871) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=39871) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=39871) return AsyncMPClient(*client_args)
(APIServer pid=39871) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 769, in init
(APIServer pid=39871) super().init(
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 448, in init
(APIServer pid=39871) with launch_core_engines(vllm_config, executor_class,
(APIServer pid=39871) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/contextlib.py", line 144, in exit
(APIServer pid=39871) next(self.gen)
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 729, in launch_core_engines
(APIServer pid=39871) wait_for_engine_startup(
(APIServer pid=39871) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 782, in wait_for_engine_startup
(APIServer pid=39871) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=39871) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/home/deaf/miniconda3/envs/vllm/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

P.S.
(vllm) deaf@rtxserver:$ vllm -v
INFO 09-14 13:21:47 [init.py:216] Automatically detected platform cuda.
0.10.2
(vllm) deaf@rtxserver:$ uname -a
Linux rtxserver 6.8.0-79-generic #79-Ubuntu SMP PREEMPT_DYNAMIC Tue Aug 12 14:42:46 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Seems something wrong with the network definition. Am I missed somthing?

n1ck-guo

Intel org Sep 18

This is an opened vllm issue https://github.com/vllm-project/vllm/issues/24803.

Here are some discussions that might help you.
https://huggingface.co/Intel/Qwen3-Next-80B-A3B-Instruct-int4-mixed-AutoRound/discussions/1

Sep 19

Thank you. I will just wait next vllm release to retest this.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment