Model Card for RLPR-Qwen2.5-7B-Base

GitHub | Paper

RLPR-Qwen2.5-7B-Base is trained from Qwen2.5-7B-Base with the RLPR framework, which eliminates reliance on external verifiers and is simple and generalizable for more domains.

Model Details

Key Features

💡 Verifier-Free Reasoning Enhancement: RLPR pioneers reinforcement learning for reasoning tasks by leveraging the LLM's intrinsic generation probability as a direct reward signal. This eliminates the need for external verifiers and specialized fine-tuning, offering broad applicability and effectively handling complex, diverse answers.
🛠️ Innovative Reward & Training Framework:
- Features a robust Probability-based Reward (PR) using average decoding probabilities of reference answers for higher quality, debiased reward signals, outperforming naive sequence likelihood.
- Implements an standard deviation filtering mechanism that dynamically filters prompts to stabilize training and significantly boost final performance.
🚀 Strong Performance in General & Mathematical Reasoning: Demonstrates substantial reasoning improvements across diverse benchmarks (e.g., 56.0 on MMLU-Pro, 55.4 on TheoremQA with Qwen2.5-7B). RLPR surpasses strong models reliant on external verifiers (like General Reasoner-7B).

Model Description

Trained from model: Qwen2.5-7B
Trained on data: RLPR-Train

Usage

Usage adopted from Qwen2.5-7B-Instruct

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "openbmb/RLPR-Qwen2.5-7B-Base"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "How much energy is produced when the sun converts one kg of hydrogen into helium?."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Citation

If you find our model/code/paper helpful, please consider citing our papers 📝:

@article{yu2025rlpr,
  title={RLPR: Extrapolating RLVR to General Domain without Verifiers},
  author={Yu, Tianyu and Ji, Bo and Wang, Shouli and Yao, Shu and Wang, Zefan and Cui, Ganqu and Yuan, Lifan and Ding, Ning and Yao, Yuan and Liu, Zhiyuan and Sun, Maosong and Chua, Tat-Seng},
  journal={arXiv preprint arXiv:2506.xxxxx},
  year={2025}
}

Downloads last month: 2

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RLAIF-V/RLPR-Qwen2.5-7B-Base

Quantizations

2 models