Result

  • Table 1. Results on the eval set

    Verifier Model Rubric Precision Rubric Recall Rubric F1 Sample Precision Sample Recall Sample F1 Avg. F1
    Qwen3-1.7B 0.41 0.49 0.34 0.48 0.40 0.32 0.33
    Qwen2.5-3B 0.42 0.47 0.43 0.49 0.46 0.43 0.43
    Qwen3-4B 0.56 0.62 0.57 0.61 0.58 0.58 0.58
    Qwen3-8B 0.54 0.66 0.55 0.62 0.61 0.57 0.56
    LLaMA-3.1-8B 0.45 0.54 0.42 0.34 0.41 0.32 0.37
    Qwen3-30B-A3B 0.56 0.66 0.56 0.63 0.62 0.62 0.58
    Qwen2.5-32B-Instruct 0.60 0.67 0.60 0.67 0.68 0.64 0.62
    Search-Gen-V-1.7B (SFT) 0.63 0.62 0.62 0.66 0.66 0.66 0.64
    Search-Gen-V-4B (SFT) 0.70 0.66 0.68 0.72 0.72 0.71 0.70
    Search-Gen-V-4B (SFT+RL) 0.71 0.68 0.70 0.74 0.74 0.73 0.72
    Qwen3-235B-A22B-Instruct-2507 0.72 0.73 0.73 0.76 0.76 0.76 0.74
  • Table 2. Accuracy comparison on verifying rubrics in longform answers from DeepResearch Bench

    Verifier Model Precision Recall F1
    Qwen3-4B 0.42 0.56 0.42
    Search-Gen-V-4B 0.59 0.57 0.57
    Qwen3-235B-A22B 0.57 0.67 0.61
  • Table 3. Results on the short-form workload, HotpotQA

    Verifier Model Precision Recall F1
    EM 0.84 0.80 0.82
    Qwen3-4B 0.83 0.70 0.71
    Search-Gen-V-4B 0.86 0.76 0.77
    Qwen3-235B-A22B 0.87 0.78 0.80
    EM + Qwen3-4B 0.94 0.92 0.93
    EM + Search-Gen-V-4B 0.95 0.93 0.94
    EM + Qwen3-235B-A22B 0.96 0.94 0.95

Related links


Citation

@article{ma2025searchgenv,
  title={AN EFFICIENT RUBRIC-BASED GENERATIVE VERIFIER FOR SEARCH-AUGMENTED LLMS},
  author={Ma, Linyue and Xu, Yilong and Long, Xiang and Zheng, Zhi},
  journal={arXiv preprint arXiv:2510.14660},
  year={2025},
  url={https://arxiv.org/abs/2510.14660}
}
Downloads last month
28
Safetensors
Model size
2B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for lnm1p/search-gen-v-1.7b-sft

Quantizations
2 models