Model Card for RLPR-Qwen2.5-7B-Base

GitHub | Paper

RLPR-Qwen2.5-7B-Base is trained from Qwen2.5-7B-Base with the RLPR framework, which eliminates reliance on external verifiers and is simple and generalizable for more domains.

Model Details

Key Features

  • πŸ’‘ Verifier-Free Reasoning Enhancement: RLPR pioneers reinforcement learning for reasoning tasks by leveraging the LLM's intrinsic generation probability as a direct reward signal. This eliminates the need for external verifiers and specialized fine-tuning, offering broad applicability and effectively handling complex, diverse answers.
  • πŸ› οΈ Innovative Reward & Training Framework:
    • Features a robust Probability-based Reward (PR) using average decoding probabilities of reference answers for higher quality, debiased reward signals, outperforming naive sequence likelihood.
    • Implements an standard deviation filtering mechanism that dynamically filters prompts to stabilize training and significantly boost final performance.
  • πŸš€ Strong Performance in General & Mathematical Reasoning: Demonstrates substantial reasoning improvements across diverse benchmarks (e.g., 56.0 on MMLU-Pro, 55.4 on TheoremQA with Qwen2.5-7B). RLPR surpasses strong models reliant on external verifiers (like General Reasoner-7B).

image/png

Model Description

Usage

Usage adopted from Qwen2.5-7B-Instruct

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "openbmb/RLPR-Qwen2.5-7B-Base"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "How much energy is produced when the sun converts one kg of hydrogen into helium?."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Citation

If you find our model/code/paper helpful, please consider citing our papers πŸ“:

@article{yu2025rlpr,
  title={RLPR: Extrapolating RLVR to General Domain without Verifiers},
  author={Yu, Tianyu and Ji, Bo and Wang, Shouli and Yao, Shu and Wang, Zefan and Cui, Ganqu and Yuan, Lifan and Ding, Ning and Yao, Yuan and Liu, Zhiyuan and Sun, Maosong and Chua, Tat-Seng},
  journal={arXiv preprint arXiv:2506.xxxxx},
  year={2025}
}
Downloads last month
2
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for RLAIF-V/RLPR-Qwen2.5-7B-Base

Quantizations
2 models