Links for Reference

Homepage: https://cupid.kixlab.org
Repository: https://github.com/kixlab/CUPID
Benchmark Dataset: https://huggingface.co/datasets/kixlab/CUPID
Paper: https://arxiv.org/abs/2508.01674
Point of Contact: [email protected]

TL; DR

PrefMatcher-7B instantiates the Preference Match metric proposed in the CUPID benchmark (COLM 2025). The model takes a preference description and an evaluation checklist to assess whether each checklist item matches or is covered by the preference. The model is trained using Qwen2.5-7B-Instruct as its base model. PrefMatcher provides a high-fidelity, cost efficient judge for automatic evaluation on the CUPID benchmark.

Model Details

PrefMatcher-7B was finetuned through QLoRA for 1 epoch on 4k data samples (i.e., prefernece-checklist matches). PrefMatcher achieved a Krippendorff's alpha of 0.748 with human annotations. The data samples were created through the synthesis pipeline for the CUPID benchmark, which were then evaluated or matched by GPT-4o. The model was trained through the torchtune library.

Model Description

Model type: Language model
Language(s) (NLP): English
License: Apache 2.0

Usage

Here is example code to use the model with VLLM to predict the match between a preference and an evaluation checklist.

from vllm import LLM, SamplingParams

model_name = "kixlab/prefmatcher-7b"

# Load the model
llm = LLM(
    model=model_name,
    load_format="safetensors",
    kv_cache_dtype="auto",
    max_model_len=512
)

# Prepare example input
preference = "Analysis should focus exclusively on visible surface defects and their direct correlation to specific printer settings."
checklist = [
    "Does the training document provide a detailed framework?",
    "Does the training document provide a systematic framework?",
    "Does the framework link external and internal test cube measurements to specific diagnostics?",
    "Does the framework link external and internal test cube measurements to specific quality improvement actions?",
]

checklist_str = "\n".join([f"{i+1}. {item}" for i, item in enumerate(checklist)])
messages = [{
    "role": "system",
    "content": "You are an analytical and insightful assistant that can determine the similarity between **evaluation checklists** and **evaluation criteria**. A criterion describes an aspect of AI outputs that should be evaluated. A checklist contain questions that are used to evaluate more specific or fine-grained aspects of the AI outputs. You will be provided with pairs of checklists and criteria. For each pair, you should determine whether each entry in the checklist is **covered** by the criterion. **Covered** means that the criterion and the checklist entry will evaluate the same or similar aspects of an AI output, even if they use different wording or phrasing."
},
{
    "role": "user",
    "content": f"#### Criterion\n\n{preference}\n\n#### Checklist\n\n{checklist_str}"
}]

sampling_params = SamplingParams(
    max_tokens=512,
    temperature=0.7
)

# Generate the output
outputs = llm.chat(messages, sampling_params=sampling_params, use_tqdm=False)

# Print the output
print(outputs[0].outputs[0].text)

Training Details

Training hyperparameters

The following hyperparameters were used for training:

learning_rate: 3e-4
train_batch_size: 4
gradient_accumulation_steps: 8
weight_decay: 1e-2
optimizer: AdamW
lr_scheduler_type: Cosine with warmup
num_warmup_steps: 100
lora_rank: 64
lora_alpha: 128
lora_dropout: 0.0
lora_attn_modules: ['q_proj', 'v_proj', 'output_proj']
apply_lora_to_mlp: True

Citation

If you find our work useful, please consider citing our paper!

BibTeX:

@article{kim2025cupid,
  title     = {CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions},
  author    = {Kim, Tae Soo and Lee, Yoonjoo and Park, Yoonah and Kim, Jiho and Kim, Young-Ho and Kim, Juho},
  journal   = {arXiv preprint arXiv:2508.01674},
  year      = {2025},
}

Downloads last month: 48

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for kixlab/prefmatcher-7b

Base model

Qwen/Qwen2.5-7B

Finetuned

(732)

this model