Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
[📂 GitHub] [📜 Sa2VA paper] [🚀 Quick Start]
Introduction
Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2.5-VL and InternVL3 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2.5-VL and InternVL3 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks.
Sa2VA Family
We built the Sa2VA series based on Qwen2.5/3-VL and InternVL2.5/3. In the following table, we provide some Sa2VA models built on Qwen2.5/3-VL and InternVL3.
| Model Name | Base MLLM | Language Part | HF Link | 
|---|---|---|---|
| Sa2VA-InternVL3-2B | InternVL3-2B | Qwen2.5-1.5B | 🤗 link | 
| Sa2VA-InternVL3-8B | InternVL3-8B | Qwen2.5-7B | 🤗 link | 
| Sa2VA-InternVL3-14B | InternVL3-14B | Qwen2.5-14B | 🤗 link | 
| Sa2VA-Qwen2_5-VL-3B | Qwen2.5-VL-3B-Instruct | Qwen2.5-3B | 🤗 link | 
| Sa2VA-Qwen2_5-VL-7B | Qwen2.5-VL-7B-Instruct | Qwen2.5-7B | 🤗 link | 
| Sa2VA-Qwen3-VL-4B | Qwen3-VL-4B-Instruct | Qwen3-4B | 🤗 link | 
Sa2VA Performance
| Model Name | MME | MMBench | RefCOCO | RefCOCO+ | RefCOCOg | MeVIS (val_u) | DAVIS | 
|---|---|---|---|---|---|---|---|
| Sa2VA-InternVL3-2B | 1631/559 | 79.8 | 81.4 | 75.7 | 80.3 | 53.9 | 74.5 | 
| Sa2VA-InternVL3-8B | 1743/633 | 83.0 | 83.3 | 78.9 | 81.8 | 56.4 | 76.3 | 
| Sa2VA-InternVL3-14B | 1746/724 | 84.3 | 83.6 | 79.9 | 83.6 | 59.2 | 76.6 | 
| Sa2VA-Qwen2_5-VL-3B | 1533/572 | 78.4 | 79.6 | 74.0 | 77.1 | 51.6 | 73.4 | 
| Sa2VA-Qwen2_5-VL-7B | 1552/676 | 84.5 | 82.4 | 77.5 | 81.5 | 56.4 | 79.4 | 
| Sa2VA-Qwen3-VL-4B | 1660/655 | 86.3 | 81.7 | 77.4 | 80.0 | 57.1 | 75.9 | 
Quick Start
We provide an example code to run Sa2VA using transformers.
import torch
from transformers import AutoProcessor, AutoModel
from PIL import Image
import numpy as np
import os
# load the model and processor
path = "ByteDance/Sa2VA-Qwen3-VL-4B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()
processor = AutoProcessor.from_pretrained(path, trust_remote_code=True, use_fast=False)
# for image chat
image_path = "/PATH/TO/IMAGE"
text_prompts = "<image>Please describe the image."
image = Image.open(image_path).convert('RGB')
input_dict = {
    'image': image,
    'text': text_prompts,
    'past_text': '',
    'mask_prompts': None,
    'processor': processor,
    }
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"] # the text format answer
# for image chat with segmentation output
image_path = "/PATH/TO/IMAGE"
text_prompts = "<image>Could you please give me a brief description of the image? Please respond with interleaved segmentation masks for the corresponding parts of the answer."
image = Image.open(image_path).convert('RGB')
input_dict = {
    'image': image,
    'text': text_prompts,
    'past_text': '',
    'mask_prompts': None,
    'processor': processor,
    }
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"] # the text format answer
masks = return_dict['prediction_masks']  # segmentation masks, list(np.array(1, h, w), ...)
    
# for chat with visual prompt (mask format) input
mask_prompts = np.load('/PATH/TO/pred_masks.npy') # np.array(n_prompts, h, w)
image_path = "/PATH/TO/IMAGE"
text_prompts = "<image>Can you provide me with a detailed description of the region in the picture marked by region1."
image = Image.open(image_path).convert('RGB')
input_dict = {
    'image': image,
    'text': text_prompts,
    'past_text': '',
    'mask_prompts': mask_prompts,
    'processor': processor,
    }
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"] # the text format answer
# for video chat
video_folder = "/PATH/TO/VIDEO_FOLDER"
images_paths = os.listdir(video_folder)
images_paths = [os.path.join(video_folder, image_path) for image_name in images_paths]
if len(images_paths) > 5:  # uniformly sample 5 frames
    step = (len(images_paths) - 1) // (5 - 1)
    images_paths = [images_paths[0]] + images_paths[1:-1][::step][1:] + [images_paths[-1]]
text_prompts = "<image>Please describe the video."
input_dict = {
    'video': images_paths,
    'text': text_prompts,
    'past_text': '',
    'mask_prompts': None,
    'processor': processor,
}
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"] # the text format answer
# for video chat with segmentation mask output
video_folder = "/PATH/TO/VIDEO_FOLDER"
images_paths = os.listdir(video_folder)
images_paths = [os.path.join(video_folder, image_path) for image_name in images_paths]
text_prompts = "<image>Please segment the person."
input_dict = {
    'video': images_paths,
    'text': text_prompts,
    'past_text': '',
    'mask_prompts': None,
    'processor': processor,
}
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"] # the text format answer
masks = return_dict['prediction_masks']  # segmentation masks, list(np.array(n_frames, h, w), ...)
Citation
If you find this project useful in your research, please consider citing:
@article{sa2va,
  title={Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos},
  author={Yuan, Haobo and Li, Xiangtai and Zhang, Tao and Huang, Zilong Huang and Xu, Shilin and Ji, Shunping and Tong, Yunhai and Qi, Lu and Feng, Jiashi and Yang, Ming-Hsuan},
  journal={arXiv preprint},
  year={2025}
}
- Downloads last month
- 79
