Sa2VA-i: Improving Sa2VA Results with Consistent Training and Inference

This model was presented in the paper Sa2VA-i: Improving Sa2VA Results with Consistent Training and Inference.

The code is available at: https://github.com/kumuji/sa2va-i

3rd Place Report of LSVOS 2025 MeViS Track

Alexey Nekrasov¹ · Ali Athar · Daan de Geus² · Alexander Hermans¹ · Bastian Leibe¹

¹RWTH Aachen University · ²Eindhoven University of Technology

Abstract

Sa2VA is a recent model for language-guided dense grounding in images and video that achieves state-of-the-art results on multiple segmentation benchmarks and that has become widely popular. However, we found that Sa2VA does not perform according to its full potential for referring video object segmentation tasks. We identify inconsistencies between training and inference procedures as the key factor holding it back. To mitigate this issue, we propose an improved version of Sa2VA, Sa2VA-i, that rectifies these issues and improves the results. In fact, Sa2VA-i sets a new state of the art for multiple video benchmarks and achieves improvements of up to +11.6 J&F on MeViS, +1.4 on Ref-YT-VOS, +3.3 on Ref-DAVIS and +4.1 on ReVOS using the same Sa2VA checkpoints. With our fixes, the Sa2VA-i-1B model even performs on par with the original Sa2VA-26B model on the MeViS benchmark. We hope that this work will show the importance of seemingly trivial implementation details and that it will provide valuable insights for the referring video segmentation field. We provide the code and updated models at this https URL

🚀 Overview

Sa2VA-i is an improved version of the popular Sa2VA model for language-guided dense grounding in images and video. While Sa2VA achieves state-of-the-art results on multiple segmentation benchmarks, we identified critical inconsistencies between training and inference procedures that limited its full potential for referring video object segmentation tasks.

Key improvements in Sa2VA-i:

✅ Consistent training and inference - eliminates incompatibility between finetuned mask decoder and frozen memory components of SAM2
✅ Improved frame sampling - uniform sampling instead of first-frame sampling
✅ Better mask propagation - uses original SAM2 weights for propagation while keeping finetuned decoder for initial predictions
✅ Significant performance gains - up to +11.6 J&F on MeViS, +1.4 on Ref-YT-VOS, +3.3 on Ref-DAVIS

📊 Performance Highlights

Model	MeViS (J&F)	Ref-YT-VOS (J&F)	Ref-DAVIS17 (J&F)
Sa2VA-1B	47.0	68.0	69.5
Sa2VA-i-1B	52.6	70.3	73.6
Sa2VA-4B	46.4	71.3	73.7
Sa2VA-i-4B	56.6	73.2	78.6
Sa2VA-8B	51.5	72.3	75.9
Sa2VA-i-8B	59.5	73.9	79.1
Sa2VA-26B	52.1	75.1	78.6
Sa2VA-i-26B	63.2	76.5	81.2

Note: Sa2VA-i-1B performs on par with original Sa2VA-26B on MeViS benchmark!

🏆 Competition Results

3rd Place in LSVOS 2025 MeViS Track (RVOS) with 64.1 J&F

🤗 Model Zoo

Sa2VA-i provides improved inference procedures for existing Sa2VA models. Available models:

Model	HuggingFace Repository
Sa2VA-i-1B	kumuji/Sa2VA-i-1B
Sa2VA-i-4B	kumuji/Sa2VA-i-4B
Sa2VA-i-8B	kumuji/Sa2VA-i-8B
Sa2VA-i-26B	kumuji/Sa2VA-i-26B

🎯 Quick Start

For installation and basic usage, please refer to the original Sa2VA repository. Sa2VA-i is a drop-in replacement for inference.

🔧 Key Improvements

1. Consistent Training-Inference

Eliminates incompatibility between finetuned mask decoder and frozen memory components by using the same procedure during both training and inference.

2. Improved Frame Sampling

Replaces first-frame sampling with uniform sampling for better coverage of video content.

3. Original SAM2 Mask Propagation

Uses original SAM2 weights for propagation while keeping finetuned decoder for initial mask predictions.

Acknowledgement

We thank Sa2VA authors for their contribution.

📚 Citation

If you use Sa2VA-i in your research, please cite:

@article{sa2va2025improved,
  title={Sa2VA-i: Improving Sa2VA Results with Consistent Training and Inference},
  author={Nekrasov, Alexey and Athar, Ali and de Geus, Daan and Hermans, Alexander and Leibe, Bastian},
  journal={arXiv preprint arXiv:2509.19082},
  year={2025}
}

Shout-out to the original Sa2VA paper!

@article{sa2va,
  title={Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos},
  author={Yuan, Haobo and Li, Xiangtai and Zhang, Tao and Huang, Zilong Huang and Xu, Shilin and Ji, Shunping and Tong, Yunhai and Qi, Lu and Feng, Jiashi and Yang, Ming-Hsuan},
  journal={arXiv preprint},
  year={2025}
}

Downloads last month: 18

Safetensors

Model size

26B params

Tensor type

F32

BF16

Model tree for kumuji/Sa2VA-i-26B

ByteDance/Sa2VA-26B

Merge model

this model