looks it not works well with vllm 0.10.1.1
uv run vllm serve /data/base/llm-models/Seed-X-PPO-7B --served-model-name Seed-X-PPO-7B --host 0.0.0.0 --port 10004 --gpu-memory-utilization 0.6 --enable-log-requests
WARNING 08-24 10:01:27 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 1, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234
INFO 08-24 10:01:29 [__init__.py:241] Automatically detected platform cuda.
(APIServer pid=3437053) INFO 08-24 10:01:31 [api_server.py:1805] vLLM API server version 0.10.1.1
(APIServer pid=3437053) INFO 08-24 10:01:31 [utils.py:326] non-default args: {'model_tag': '/data/base/llm-models/Seed-X-PPO-7B', 'host': '0.0.0.0', 'port': 10004, 'model': '/data/base/llm-models/Seed-X-PPO-7B', 'served_model_name': ['Seed-X-PPO-7B'], 'gpu_memory_utilization': 0.6, 'enable_log_requests': True}
(APIServer pid=3437053) INFO 08-24 10:01:37 [__init__.py:711] Resolved architecture: MistralForCausalLM
(APIServer pid=3437053) INFO 08-24 10:01:37 [__init__.py:1750] Using max model len 32768
(APIServer pid=3437053) INFO 08-24 10:01:37 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=3437053) Traceback (most recent call last):
(APIServer pid=3437053) File "/data/base/vllm-project/.venv/bin/vllm", line 10, in <module>
(APIServer pid=3437053) sys.exit(main())
(APIServer pid=3437053) ^^^^^^
(APIServer pid=3437053) File "/data/base/vllm-project/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=3437053) args.dispatch_function(args)
(APIServer pid=3437053) File "/data/base/vllm-project/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=3437053) uvloop.run(run_server(args))
(APIServer pid=3437053) File "/data/base/vllm-project/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
(APIServer pid=3437053) return __asyncio.run(
(APIServer pid=3437053) ^^^^^^^^^^^^^^
(APIServer pid=3437053) File "/miniconda/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=3437053) return runner.run(main)
(APIServer pid=3437053) ^^^^^^^^^^^^^^^^
(APIServer pid=3437053) File "/miniconda/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=3437053) return self._loop.run_until_complete(task)
(APIServer pid=3437053) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3437053) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=3437053) File "/data/base/vllm-project/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
(APIServer pid=3437053) return await main
(APIServer pid=3437053) ^^^^^^^^^^
(APIServer pid=3437053) File "/data/base/vllm-project/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1850, in run_server
(APIServer pid=3437053) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=3437053) File "/data/base/vllm-project/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1870, in run_server_worker
(APIServer pid=3437053) async with build_async_engine_client(
(APIServer pid=3437053) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3437053) File "/miniconda/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=3437053) return await anext(self.gen)
(APIServer pid=3437053) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3437053) File "/data/base/vllm-project/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client
(APIServer pid=3437053) async with build_async_engine_client_from_engine_args(
(APIServer pid=3437053) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3437053) File "/miniconda/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=3437053) return await anext(self.gen)
(APIServer pid=3437053) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3437053) File "/data/base/vllm-project/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 220, in build_async_engine_client_from_engine_args
(APIServer pid=3437053) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=3437053) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3437053) File "/data/base/vllm-project/.venv/lib/python3.12/site-packages/vllm/utils/__init__.py", line 1557, in inner
(APIServer pid=3437053) return fn(*args, **kwargs)
(APIServer pid=3437053) ^^^^^^^^^^^^^^^^^^^
(APIServer pid=3437053) File "/data/base/vllm-project/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 174, in from_vllm_config
(APIServer pid=3437053) return cls(
(APIServer pid=3437053) ^^^^
(APIServer pid=3437053) File "/data/base/vllm-project/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 103, in __init__
(APIServer pid=3437053) self.tokenizer = init_tokenizer_from_configs(
(APIServer pid=3437053) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3437053) File "/data/base/vllm-project/.venv/lib/python3.12/site-packages/vllm/transformers_utils/tokenizer_group.py", line 121, in init_tokenizer_from_configs
(APIServer pid=3437053) return TokenizerGroup(
(APIServer pid=3437053) ^^^^^^^^^^^^^^^
(APIServer pid=3437053) File "/data/base/vllm-project/.venv/lib/python3.12/site-packages/vllm/transformers_utils/tokenizer_group.py", line 26, in __init__
(APIServer pid=3437053) self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
(APIServer pid=3437053) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3437053) File "/data/base/vllm-project/.venv/lib/python3.12/site-packages/vllm/transformers_utils/tokenizer.py", line 260, in get_tokenizer
(APIServer pid=3437053) raise e
(APIServer pid=3437053) File "/data/base/vllm-project/.venv/lib/python3.12/site-packages/vllm/transformers_utils/tokenizer.py", line 239, in get_tokenizer
(APIServer pid=3437053) tokenizer = AutoTokenizer.from_pretrained(
(APIServer pid=3437053) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3437053) File "/data/base/vllm-project/.venv/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 1138, in from_pretrained
(APIServer pid=3437053) return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
(APIServer pid=3437053) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=3437053) File "/data/base/vllm-project/.venv/lib/python3.12/site-packages/transformers/tokenization_mistral_common.py", line 1762, in from_pretrained
(APIServer pid=3437053) raise ValueError(
(APIServer pid=3437053) ValueError: Kwargs ['max_loras', '_from_auto'] are not supported by `MistralCommonTokenizer.from_pretrained`.
Hi, we recommend using vllm==0.8.0, transformers==4.51.3.
At the current moment, running this version still does not work(I’ve gotten myself into the mess of PyTorch version compatibility).
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm==0.8.0 transformers==4.51.3 --torch-backend=cu126
Nevertheless, I appreciate your response, and I’ll look for another solution.
Hi, any news on if this issue has been fixed?
Hi, any news on if this issue has been fixed?
vllm==0.10.0 transformers==4.53.2 can work in 5090
Yes, It works with uv pip install vllm==0.10 transformers==4.53.2 --torch-backend=cu126 but vllm==0.8 and transformers==4.51.3 not works!!!
Yes, It works with
uv pip install vllm==0.10 transformers==4.53.2 --torch-backend=cu126but vllm==0.8 and transformers==4.51.3 not works!!!
I find some details, if you are interesting, see:
https://github.com/vllm-project/vllm/issues/21653#issuecomment-3434094953
So, we can solve the problem by setting tokenizer_class = "LlamaTokenizer" in a newly created tokenizer_config.json.
Of course, this solution should be checked to avoid performance degradation.