AI & ML interests

None defined yet.

Recent Activity

eunkey  updated a model about 1 month ago
brl-xfact/Multi-TAP-internlm-xcomposer2d5-7b
eunkey  updated a model about 1 month ago
brl-xfact/Multi-TAP-Qwen2-VL-7B
eunkey  updated a model about 1 month ago
brl-xfact/Multi-TAP-Qwen2-VL-2B
View all activity

MULTI-TAP: Multi-Objective Task-Aware Predictor for Image-Text Alignment

Hello, we are a team of researchers based in KAIST AI working on multimodal evaluation and accessible AI systems. In this project, we introduce MULTI-TAP, a plug-and-play predictor for image-text alignment that supports both single- and multi-objective scoring. We also release EYE4ALL, a human-annotated dataset for evaluating multimodal responses, incorporating perspectives of blind and low-vision (BLV) individuals.


Abstract

Evaluating image-text alignment while reflecting human preferences across multiple aspects is a significant issue for the development of reliable vision-language applications. It becomes especially crucial in real-world scenarios where multiple valid descriptions exist depending on contexts or user needs. However, research progress is hindered by the lack of comprehensive benchmarks and existing evaluation predictors lacking at least one of these key properties: (1) Alignment with human judgments, (2) Long-sequence processing, (3) Inference efficiency, and (4) Applicability to multi-objective scoring. To address these challenges, we propose a plug-and-play architecture to build a robust predictor, MULTI-TAP (Multi-Objective Task-Aware Predictor), capable of both multi and single-objective scoring. MULTI-TAP can produce a single overall score, utilizing a reward head built on top of a large vision-language model (LVLMs). We show that MULTI-TAP is robust in terms of application to different LVLM architectures, achieving significantly higher performance than existing metrics (e.g., +42.3 Kendall's tau c compared to IXCREW-S on FlickrExp) and even on par with the GPT-4o-based predictor, G-VEval, with a smaller size (7-8B). By training a lightweight ridge regression layer on the frozen hidden states of a pre-trained LVLM, MULTI-TAP can produce fine-grained scores for multiple human-interpretable objectives. MULTI-TAP performs better than VisionREWARD, a high-performing multi-objective reward model, in both performance and efficiency on multi-objective benchmarks and our newly released text-image-to-text dataset, EYE4ALL. Our new dataset, consisting of chosen/rejected human preferences (EYE4ALLPref) and human-annotated fine-grained scores across seven dimensions (EYE4ALLMulti), can serve as a foundation for developing more accessible AI systems by capturing the underlying preferences of users, including blind and low-vision (BLV) individuals. Our contributions can guide future research for developing human-aligned predictors.

MULTI-TAP Framework

  • Single-objective predictor: Produces an overall alignment score by training a lightweight reward head on top of LVLM hidden states.
  • Multi-objective predictor: Adds a ridge regression layer for generating fine-grained scores across human-interpretable dimensions.
  • Backbone agnostic: Compatible with Qwen2-VL, InternLM-XComposer, and LLaMA-3.2 LVLMs.
  • Efficient: Reduces inference time up to 14× compared to generative reward models.

*The two-stage MULTI-TAP framework: Stage 1 (single-objective reward modeling), Stage 2 (multi-objective scoring).)

EYE4ALL Dataset

  • EYE4ALLPref: Pairwise human preferences for LVLM responses.
  • EYE4ALLMulti: Scalar human scores across 7 dimensions:
    1. Direction Accuracy
    2. Depth Accuracy
    3. Safety
    4. Sufficiency
    5. Conciseness
    6. Hallucination
    7. Overall Quality

Example annotations from EYE4ALL. Human evaluators rated LVLM outputs on multiple fine-grained criteria.

Results

MULTI-TAP consistently achieves stronger correlation with human judgments compared to prior predictors.
It performs on par with GPT-4o-based G-VEval but is significantly more efficient, and it surpasses VisionREW-S on multi-objective scoring benchmarks.

Performance comparison: MULTI-TAP vs. existing predictors on single- and multi-objective benchmarks.

BibTeX

If you find our work helpful, please cite us:

<!-- @inproceedings{multi_tap,
  title={Multi-Objective Task-Aware Predictor for Image-Text Alignment},
  author={Anonymous},,
  year={2026},
  url={https://arxiv.org/abs/TODO}
} -->