Links for Reference
- Homepage: https://cupid.kixlab.org
- Repository: https://github.com/kixlab/CUPID
- Benchmark Dataset: https://huggingface.co/datasets/kixlab/CUPID
- Paper: https://arxiv.org/abs/2508.01674
- Point of Contact: [email protected]
TL; DR
PrefMatcher-7B instantiates the Preference Match metric proposed in the CUPID benchmark (COLM 2025). The model takes a preference description and an evaluation checklist to assess whether each checklist item matches or is covered by the preference. The model is trained using Qwen2.5-7B-Instruct as its base model. PrefMatcher provides a high-fidelity, cost efficient judge for automatic evaluation on the CUPID benchmark.
Model Details
PrefMatcher-7B was finetuned through QLoRA for 1 epoch on 4k data samples (i.e., prefernece-checklist matches). PrefMatcher achieved a Krippendorff's alpha of 0.748 with human annotations. The data samples were created through the synthesis pipeline for the CUPID benchmark, which were then evaluated or matched by GPT-4o. The model was trained through the torchtune library.
Model Description
- Model type: Language model
- Language(s) (NLP): English
- License: Apache 2.0
Usage
Here is example code to use the model with VLLM to predict the match between a preference and an evaluation checklist.
from vllm import LLM, SamplingParams
model_name = "kixlab/prefmatcher-7b"
# Load the model
llm = LLM(
model=model_name,
load_format="safetensors",
kv_cache_dtype="auto",
max_model_len=512
)
# Prepare example input
preference = "Analysis should focus exclusively on visible surface defects and their direct correlation to specific printer settings."
checklist = [
"Does the training document provide a detailed framework?",
"Does the training document provide a systematic framework?",
"Does the framework link external and internal test cube measurements to specific diagnostics?",
"Does the framework link external and internal test cube measurements to specific quality improvement actions?",
]
checklist_str = "\n".join([f"{i+1}. {item}" for i, item in enumerate(checklist)])
messages = [{
"role": "system",
"content": "You are an analytical and insightful assistant that can determine the similarity between **evaluation checklists** and **evaluation criteria**. A criterion describes an aspect of AI outputs that should be evaluated. A checklist contain questions that are used to evaluate more specific or fine-grained aspects of the AI outputs. You will be provided with pairs of checklists and criteria. For each pair, you should determine whether each entry in the checklist is **covered** by the criterion. **Covered** means that the criterion and the checklist entry will evaluate the same or similar aspects of an AI output, even if they use different wording or phrasing."
},
{
"role": "user",
"content": f"#### Criterion\n\n{preference}\n\n#### Checklist\n\n{checklist_str}"
}]
sampling_params = SamplingParams(
max_tokens=512,
temperature=0.7
)
# Generate the output
outputs = llm.chat(messages, sampling_params=sampling_params, use_tqdm=False)
# Print the output
print(outputs[0].outputs[0].text)
Training Details
Training hyperparameters
The following hyperparameters were used for training:
- learning_rate: 3e-4
- train_batch_size: 4
- gradient_accumulation_steps: 8
- weight_decay: 1e-2
- optimizer: AdamW
- lr_scheduler_type: Cosine with warmup
- num_warmup_steps: 100
- lora_rank: 64
- lora_alpha: 128
- lora_dropout: 0.0
- lora_attn_modules: ['q_proj', 'v_proj', 'output_proj']
- apply_lora_to_mlp: True
Citation
If you find our work useful, please consider citing our paper!
BibTeX:
@article{kim2025cupid,
title = {CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions},
author = {Kim, Tae Soo and Lee, Yoonjoo and Park, Yoonah and Kim, Jiho and Kim, Young-Ho and Kim, Juho},
journal = {arXiv preprint arXiv:2508.01674},
year = {2025},
}
- Downloads last month
- 48
Model tree for kixlab/prefmatcher-7b
Base model
Qwen/Qwen2.5-7B