SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

The model was presented in the paper SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward.

Paper abstract

The abstract of the paper is the following:

Recent advances have shown success in eliciting strong reasoning abilities in multimodal large language models (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards. However, this paradigm typically lacks supervision over the thinking process leading to the final this http URL a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability. In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm. To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process. Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training. This weight is computed based on the thinking reward comparison of responses leading to correct answers versus incorrect answers, helping to mitigate the impact of potentially unreliable thinking rewards. Moreover, we design an annealing training strategy that gradually reduces the thinking reward over time, allowing the model to rely more on the accurate rule-based outcome reward in later training stages. Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks (e.g., MathVisita, MMMU), demonstrating strong reasoning and generalization capabilities. Notably, our SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite the latter having 10 times more parameters. All code, models, and datasets are made publicly available at this https URL .

Current model card

The README of the model repository currently looks like this:

Metadata

license: apache-2.0

Content

This is the repository for SophiaVL-R1-7B (https://arxiv.org/abs/2505.17018).

For training and evaluation, please refer to the Code: SophiaVL-R1.

A simple inference example:

from transformers import AutoProcessor
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image

MODEL_PATH = "bunny127/SophiaVL-R1-7B"
# Example usage:
    # {
    #     "problem_id": 1,
    #     "problem": "Subtract 0 cyan cubes. How many objects are left?",
    #     "data_type": "image",
    #     "problem_type": "numerical",
    #     "options": [],
    #     "process": "",
    #     "solution": "<answer>5</answer>",
    #     "path": "./Math/CLEVR-Math/images/CLEVR_train_036427.png",
    #     "data_source": "CLEVR-Math"
    # },
image_path = "/path/to/dataset/Math/CLEVR-Math/images/CLEVR_train_036427.png"
prompt = "Subtract 0 cyan cubes. How many objects are left?"
question_type = "numerical"


model = Qwen2_5_VLForConditionalGeneration.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2",device_map="auto")
processor = AutoProcessor.from_pretrained(MODEL_PATH)

SYS_PROMPT = """You FIRST think about the reasoning process as an internal monologue and then provide the final answer.
The reasoning process MUST BE enclosed within <think> </think> tagsdd. The final answer MUST BE enclosed within <answer> </answer> tags, for example <think>your_thinking_process</think><answer>your_final_answer</answer>. If you use formula, please use LaTeX format."""

QUESTION_TEMPLATE = (
    "{Question}
"
    "Please think about this question as if you were a human pondering deeply. "
    "Engage in an internal dialogue using expressions such as 'let me think', 'wait', 'Hmm', 'oh, I see', 'let's break it down', etc, or other natural language thought expressions "
    "It's encouraged to include self-reflection or verification in the reasoning process. "
    "Provide your detailed reasoning between the <think> and </think> tags, and then give your final answer between the <answer> and </answer> tags."
)

TYPE_TEMPLATE = {
    "multiple choice": " Please provide only the single option letter (e.g., A, B, C, D, etc.) within the <answer> </answer> tags.",
    "numerical": " Please provide the numerical value (e.g., 42 or 3.14) within the <answer> </answer> tags.",
    "OCR": " Please transcribe text from the image/video clearly and provide your text answer within the <answer> </answer> tags.",
    "free-form": " Please provide your text answer within the <answer> </answer> tags."
}

def inference(image_path, question, problem_type = "numerical", sys_prompt="You are a helpful assistant.", max_new_tokens=4096, return_input=False):
    image = Image.open(image_path)
    image_local_path = "file://" + image_path
    messages = [
        {"role": "system", "content": sys_prompt},
        {"role": "user", "content": [
                {"type": "text", "text": QUESTION_TEMPLATE.format(Question=question) + TYPE_TEMPLATE[problem_type]},
                {"image": image_local_path},
            ]
        },
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    print("text:", text)
    # image_inputs, video_inputs = process_vision_info([messages])
    inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
    inputs = inputs.to('cuda')

    output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
    output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    if return_input:
        return output_text[0], inputs
    else:
        return output_text[0]

response = inference(image_path, prompt, question_type, sys_prompt=SYS_PROMPT, max_new_tokens=2048)
print(response)

Project page

The project page URL we found has the following URL: https://github.com/kxfan2002/SophiaVL-R1

Downloads last month: 678

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for bunny127/SophiaVL-R1-7B

Quantizations

1 model