PromptCoT 2.0 โ Problem Generation Model
This repository hosts the Problem Generation Model (PGM) used in PromptCoT 2.0, a framework for scalable prompt synthesis that advances LLM reasoning in mathematics and programming.
โจ Overview
This checkpoint is the Problem Generation Model (PGM) of PromptCoT 2.0.
- Input: a set of domain concepts (math or programming) and an optional difficulty tag.
- Output: a rationale (the structured โthinking processโ that connects the concepts) followed by a fully formed problem (Olympiad-level math or coding task).
How it fits into PromptCoT 2.0:
PromptCoT 2.0 jointly trains two models via an EM optimization loop:
- Rationale Generator (E-step): infers rationales given concepts and problems, updated via reinforcement learning with reward signals.
- Problem Generation Model (PGM) (M-step): learns to produce rationaleโproblem pairs conditioned only on concepts.
At inference time, the PGM is all you need: provide concepts and it will generate (rationale โ problem) in one passโwithout any handcrafted templates or domain-specific heuristics.
๐ฆ Model Details
- Model type: Causal language model for problem generation.
- Training data: Conceptโrationaleโproblem triples synthesized and refined via PromptCoT 2.0.
- Domains: Mathematics (Olympiad-level) and Programming (competitive programming).
- Initialization: Warm-started from
Qwen2.5-32B-Basewith cold-start annotations (concepts & rationales) generated by instruction-tuned models.
๐ Usage
You can load this model with Hugging Face transformers:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "xl-zhao/PromptCoT-2.0-Prompt-Generation-Model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
concept_text = "graph traversal, recursion, dynamic programming"
level = "codeforces"
prompt = f"""Given the foundational programming concepts and specified difficulty level, identify connections among these concepts and develop an olympiad-level coding problem that integrates them with appropriate complexity.
Foundational Programming Concepts:
{concept_text}
Difficulty Level: {level}"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=4096, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
๐ Prompt Templates
Use the following templates for inference. Replace {concept_text} and {level}.
Programming (code)
Given the foundational programming concepts and specified difficulty level, identify connections among these concepts and develop an olympiad-level coding problem that integrates them with appropriate complexity.
Foundational Programming Concepts:
{concept_text}
Difficulty Level: {level}
Mathematics (math)
Given the foundational mathematical concepts and specified difficulty level, identify connections among these concepts and develop an olympiad-level mathematical problem that integrates them with appropriate complexity.
Foundational Mathematical Concepts:
{concept_text}
Difficulty Level: {level}
Expected format: The output will first include a Rationale (multi-step explanation of how the concepts are combined) and then a precise Problem statement.
๐ฎ Applications
The PGM is the core component powering the creation of:
- Self-Play datasets (math/code problems paired with verifiable answers or unit tests).
- SFT datasets (problems with complete reasoning traces distilled from teacher models).
๐ Results
PromptCoT 2.0 demonstrates that rationale-driven prompt synthesis yields harder and more diverse problems than existing datasets.
Self-Play (30B-A3B):
Achieves strong gains in both mathematics and programming.- Math: 92.1 on AIME24, 89.8 on AIME25, 76.7 on HMMT Feb25.
- Code: 74.2 on LiveCodeBench v5, 71.0 on v6, and 2079 Elo on Codeforces.
Overall, performance is competitive with Gemini 2.5 Pro / OpenAI o3 and surpasses strong open-source baselines.
SFT (7B, 100% synthetic):
Demonstrates that fully synthetic data can rival or outperform human-written datasets.- Math: 73.1 on AIME24, 65.6 on AIME25, 46.5 on HMMT Feb25.
- Code: 53.4 on LiveCodeBench v5, 48.9 on v6, and 1815 Elo on Codeforces.
These results exceed human-written baselines such as OpenMathReasoning and OpenCodeReasoning, highlighting the scalability of synthetic data.
๐ Resources
- ๐ Paper (arXiv:2509.19894)
- ๐ค HF Collection
- ๐ PromptCoT 2.0 SFT Data (4.8M prompts)
- ๐ค PromptCoT 2.0 SFT Model (7B)
- ๐ฎ Self-Play Models (4B, 30B-A3B)
๐ Citation
If you find this model or the PromptCoT 2.0 framework useful, please cite:
@article{zhao2025promptcot2,
title = {PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning},
author = {Zhao, Xueliang and Wu, Wei and Guan, Jian and Gong, Zhuocheng and Kong, Lingpeng},
journal = {arXiv preprint arXiv:2509.19894},
year = {2025},
url = {https://arxiv.org/abs/2509.19894}
}
- Downloads last month
- 9