Spatial-SSRL-7B

๐Ÿ“–Paper| ๐Ÿ Github |๐Ÿค—Spatial-SSRL-7B Model | ๐Ÿค—Spatial-SSRL-81k Dataset | ๐Ÿ“ฐDaily Paper

Spatial-SSRL-7B is a large vision-language model targeting spatial understanding, built on the base of Qwen2.5-VL-7B. It's optimized by applying Spatial-SSRL, a lightweight self-supervised reinforcement learning paradigm which can scale RLVR efficiently. The model demonstrates strong spatial intelligence while preserving the original general visual capabilities of the base model.

๐Ÿ“ข News

๐ŸŒˆ Overview

We are thrilled to introduce Spatial-SSRL, a novel self-supervised RL paradigm aimed at enhancing LVLM spatial understanding. By optimizing Qwen2.5-VL-7B with Spatial-SSRL, the model exhibits stronger spatial intelligence across seven spatial understanding benchmarks in both image and video settings.

Teaser

Spatial-SSRL is a lightweight tool-free framework that is natually compatible with the RLVR training paradigm and easy to extend to a multitude of pretext tasks. Five tasks are currently formulated in the framework, requiring only ordinary RGB and RGB-D images. And we welcome you to join Spatial-SSRL with effective pretext tasks to further strengthen the capabilities of LVLMs!

Pipeline

๐Ÿ’ก Highlights

  • ๐Ÿ”ฅ Highly Scalable: Spatial-SSRL uses ordinary raw RGB and RGB-D images instead of richly-annotated public datasets or manual labels for data curation, making it highly scalable.
  • ๐Ÿ”ฅ Cost-effective: Avoiding the need for human labels or API calls for general LVLMs throughout the entire pipeline endows Spatial-SSRL with cost-effectiveness.
  • ๐Ÿ”ฅ Lightweight: Prior approaches for spatial understanding heavily rely on annotation of external tools, incurring inherent errors in training data and additional cost. In constrast, Spatial-SSRL is completely tool-free and can easily be extended to more self-supervised tasks.
  • ๐Ÿ”ฅ Naturally Verifiable: Intrinsic supervisory signals determined by pretext objectives are naturally verifiable, aligning Spatial-SSRL well with the RLVR paradigm.

    Teaser

๐Ÿ“Š Results

We train Qwen2.5-VL-3B and Qwen2.5-VL-7B with our Spatial-SSRL paradigm and the experimental results across seven spatial understanding benchmarks are shown below.

Pipeline

๐Ÿ› ๏ธ Usage

To directly experience Spatial-SSRL-7B, you can try it out on ๐Ÿค—Spatial-SSRL Space!

Here we provide a code snippet for you to start a simple trial of Spatial-SSRL-7B on your own device. You can download the model from ๐Ÿค—Spatial-SSRL-7B Model before your trial!

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model_path = "internlm/Spatial-SSRL-7B" #You can change it to your own local path if deployed already
img_path = "examples/eg1.jpg"
question = "Consider the real-world 3D locations of the objects. Which object has a higher location? A. yellow bear kite B. building"
#We recommend using the format prompt to make the inference consistent with training
format_prompt = "\n You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \\boxed{}."

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_path)
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": img_path,
            },
            {"type": "text", "text": question + format_prompt},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Model Response:", output_text)

Cases

Teaser

Teaser

โœ’๏ธCitation

If you find our model useful, please kindly cite:

@article{liu2025spatialssrl,
  title={Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning}, 
  author={Liu, Yuhong and Zhang, Beichen and Zang, Yuhang and Cao, Yuhang and Xing, Long and Dong, Xiaoyi and Duan, Haodong and Lin, Dahua and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2510.27606},
  year={2025}
}

๐Ÿ“„ License

Code License Data License

Usage and License Notices: The data and code are intended and licensed for research use only.

Downloads last month
58
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for internlm/Spatial-SSRL-7B

Finetuned
(841)
this model
Quantizations
2 models

Dataset used to train internlm/Spatial-SSRL-7B

Space using internlm/Spatial-SSRL-7B 1