REVERSE-Qwen2.5-VL-3B

arXiv

Model Summary

REVERSE-Qwen2.5-VL-3B is a novel open-source vision-language model (VLM) that performs both next-token predictioin and self-verification / self-correction during the generation process. Built on top of Qwen2.5-VL-3B-Instruct, it is fine-tuned using the 100k-subset of REVERSE Visual Instruct 1.3M dataset and equipped with a retrospective resampling mechanism that allows it to detect and correct hallucinations during generation. The model is trained in early May, 2025.

Performance

REVERSE achieves state-of-the-art hallucination reduction across diverse captioning and open-ended visual question answering benchmarks. To ensure the apple-to-apple comparison, we fine-tune the released Qwen2.5-VL-3B model using both the LLaVA-FT setup and our REVERSE recipe, applying both on the same 100k subset. This allows us to directly compare the impact of our method against the LLaVA-FT baseline under consistent conditions as the Qwen2.5-VL's instruction tuning data is not publicly available.

Benchmark Metric Qwen2.5-VL-FT REVERSE (Ο„=0.01)
CHAIR-MSCOCO CHAIRi (↓) 12.2 10.5
CHAIRs (↓) 45.8 39.4
AMBER-G CHAIR (↓) 7.7 7.5
Coverage (↑) 51.7 51.5
MMHal-Bench Score (↑) 2.89 3.15
Hallucination Rate (↓) 0.43 0.29
HaloQuest Avg. Accuracy (↑) 33.5 45.1
False Premise Acc. (↑) 25.4 42.9
Visual Challenging Acc. (↑) 51.6 41.8
Insufficient Context Acc. (↑) 26.4 55.5

It also performs competitively on discriminative tasks compared with the base VLM.

Benchmark Metric Qwen2.5-VL-FT REVERSE (Ο„=0.5)
AMBER-D F1 Score (↑) 85.0 85.7
POPE F1 Score (↑) 87.1 86.5
MME-Hall Score (↑) 550.4 589.5

Usage

Please refer to the installation guide on GitHub to get started:
πŸ‘‰ Installation Guide

Additional Resources

Intended Use

Primary Use Cases:

  • Reducing hallucination in image captioning and VQA tasks
  • Benchmarking hallucination-aware generation
  • Research on grounded vision-language generation and self-correction

Target Users:
Researchers, developers, and students working in computer vision, NLP, and multimodal AI.

Downloads last month
46
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for tsunghanwu/reverse_qwen25_vl

Finetuned
(518)
this model

Dataset used to train tsunghanwu/reverse_qwen25_vl

Collection including tsunghanwu/reverse_qwen25_vl