Reasoning not parsing correctly within VLLM

#4
by SuperbEmphasis - opened

I have verified Mistral Common 1.8.6 is installed. I am running using VLLM 0.12.0 in kubernetes.

However the response is going into the thinking/reasoning.

image

I also set the flags from the repo exactly:

(APIServer pid=44) INFO 12-03 11:40:29 [entrypoints/utils.py:253] non-default args: {'model_tag': '/models/Ministral-3-14B-Reasoning-2512', 'enable_auto_tool_choice': True, 'tool_call_parser': 'mistral', 'model': '/models/Ministral-3-14B-Reasoning-2512', 'tokenizer_mode': 'mistral', 'trust_remote_code': True, 'max_model_len': 65536, 'served_model_name': ['Ministral-3-14B-Reasoning-2512'], 'config_format': 'mistral', 'load_format': 'mistral', 'reasoning_parser': 'mistral'}
Mistral AI_ org

Hi there, are you using the default system prompt we recommend for reasoning?

HI pandora-s,

can we use vllm serve to use the text in SYSTEM_PROMPT.txt?
or should me send the text SYSTEM_PROMPT.txt in the client everytime?

@pandora-s
Yes I have tried, although I find it odd that when using chat completions, it isnt using the chat template automatically with that set?

I grabbed the default system prompt from the jinja template (Thank you guys for supplying that btw...), and I tried to copy/paste exatly what is in the .txt file you provided, however the thinking still seems to be... off....

The model seems to be starting with a double quotation mark, and still thinking anyways. The ending [/THINK] is also not being handled correctly...
image

But ultimately, it would make more sense to be able to host the model via vLLM in an environment where an end user might be using the API. It seems odd that I would need to tell them to use a specific prompt for it to work correctly? Thanks for the response!

Mistral AI_ org
β€’
edited 4 days ago

hey 😊 you're not supposed to pass the system prompt as a block of text
https://huggingface.co/mistralai/Ministral-3-14B-Reasoning-2512#usage-of-the-model
could you take a look at the examples and try it out by sending think chunks please ?

Also we do not use chat templates with vLLM, the chat template is here for Transformers integration and integrations in libraries that do not support our processing library mistral-common.

It seems odd that I would need to tell them to use a specific prompt for it to work correctly?

our reasoning models are for now sensitive to the system prompt. Regarding your issue, is it possible for you to add in the beginning of your end user's messages a system prompt ? That way you don't have to ask them to add it.

That's very cumbersome tbh
There should be some good user guide on how to set this properly in vllm.
I'm sure people won't even notice that reasoning is not set properly and will just complain on X the model is bad.

I'm still not sure how it's supposed to work.

or maybe it only works in the streaming mode for now?
when I disable streaming, "content" remains null, I have reasoning and reasoning_content containing the same text.

ChatCompletion(id='chatcmpl-99f7f2fc6a7ae15a', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[], reasoning='Okay, I need to use the numbers 2, 5, 6, 3 exactly once, along with the operations +, -, Γ—, Γ·, and parentheses, to make the number 24. Let me start by thinking about possible combinations.\n\nFirst, I\'ll list the numbers: 2, 5, 6, 3.\n\nI need to combine these with operations to get 24. Maybe I can start by trying to multiply two numbers and then add or subtract the others.\n\nLet me try multiplying 6 and 4, but 4 isn\'t there. Maybe 6 Γ— 4 is not directly possible. Wait, perhaps 6 Γ— 3 = 18. Then I have 2 and 5 left. 18 + 5 + 2 = 25, which is not 24. Maybe 18 + 5 - 2 = 21, not 24. Hmm.\n\nWhat if I do 6 Γ— 3 = 18, then 18 + 5 = 23, and 23 + 2 = 25. Not 24.\n\nMaybe I should try dividing. Let\'s see, 24 is divisible by 6, so perhaps 24 / 6 = 4. But how to g...
`

Same issue.
Even using exactly the same code snippet from https://huggingface.co/mistralai/Ministral-3-14B-Reasoning-2512#usage-of-the-model , except setting streaming=False results in an empty "content" field. stream.choices[0].message.content is None and everything goes to reasoning_content and you can't separate reasoning from answer.
I checked with temp=0.0, if you just set streaming=False the text after reasoning block is just not available in the response. The reasoning part is the same for streaming on/off.

Sign up or log in to comment