PromptCoT 2.0 — Problem Generation Model

This repository hosts the Problem Generation Model (PGM) used in PromptCoT 2.0, a framework for scalable prompt synthesis that advances LLM reasoning in mathematics and programming.

✨ Overview

This checkpoint is the Problem Generation Model (PGM) of PromptCoT 2.0.

Input: a set of domain concepts (math or programming) and an optional difficulty tag.
Output: a rationale (the structured “thinking process” that connects the concepts) followed by a fully formed problem (Olympiad-level math or coding task).

How it fits into PromptCoT 2.0:
PromptCoT 2.0 jointly trains two models via an EM optimization loop:

Rationale Generator (E-step): infers rationales given concepts and problems, updated via reinforcement learning with reward signals.
Problem Generation Model (PGM) (M-step): learns to produce rationale–problem pairs conditioned only on concepts.

At inference time, the PGM is all you need: provide concepts and it will generate (rationale → problem) in one pass—without any handcrafted templates or domain-specific heuristics.

📦 Model Details

Model type: Causal language model for problem generation.
Training data: Concept–rationale–problem triples synthesized and refined via PromptCoT 2.0.
Domains: Mathematics (Olympiad-level) and Programming (competitive programming).
Initialization: Warm-started from Qwen2.5-32B-Base with cold-start annotations (concepts & rationales) generated by instruction-tuned models.

🚀 Usage

You can load this model with Hugging Face transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "xl-zhao/PromptCoT-2.0-Prompt-Generation-Model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

concept_text = "graph traversal, recursion, dynamic programming"
level = "codeforces"

prompt = f"""Given the foundational programming concepts and specified difficulty level, identify connections among these concepts and develop an olympiad-level coding problem that integrates them with appropriate complexity.

Foundational Programming Concepts:
{concept_text}

Difficulty Level: {level}"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=4096, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📑 Prompt Templates

Use the following templates for inference. Replace {concept_text} and {level}.

Programming (code)

Given the foundational programming concepts and specified difficulty level, identify connections among these concepts and develop an olympiad-level coding problem that integrates them with appropriate complexity.

Foundational Programming Concepts:
{concept_text}

Difficulty Level: {level}

Mathematics (math)

Given the foundational mathematical concepts and specified difficulty level, identify connections among these concepts and develop an olympiad-level mathematical problem that integrates them with appropriate complexity.

Foundational Mathematical Concepts:
{concept_text}

Difficulty Level: {level}

Expected format: The output will first include a Rationale (multi-step explanation of how the concepts are combined) and then a precise Problem statement.

🔮 Applications

The PGM is the core component powering the creation of:

Self-Play datasets (math/code problems paired with verifiable answers or unit tests).
SFT datasets (problems with complete reasoning traces distilled from teacher models).

📊 Results

PromptCoT 2.0 demonstrates that rationale-driven prompt synthesis yields harder and more diverse problems than existing datasets.

Self-Play (30B-A3B):
Achieves strong gains in both mathematics and programming.
- Math: 92.1 on AIME24, 89.8 on AIME25, 76.7 on HMMT Feb25.
- Code: 74.2 on LiveCodeBench v5, 71.0 on v6, and 2079 Elo on Codeforces.
  Overall, performance is competitive with Gemini 2.5 Pro / OpenAI o3 and surpasses strong open-source baselines.
SFT (7B, 100% synthetic):
Demonstrates that fully synthetic data can rival or outperform human-written datasets.
- Math: 73.1 on AIME24, 65.6 on AIME25, 46.5 on HMMT Feb25.
- Code: 53.4 on LiveCodeBench v5, 48.9 on v6, and 1815 Elo on Codeforces.
  These results exceed human-written baselines such as OpenMathReasoning and OpenCodeReasoning, highlighting the scalability of synthetic data.

📂 Resources

📜 Citation

If you find this model or the PromptCoT 2.0 framework useful, please cite:

@article{zhao2025promptcot2,
  title     = {PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning},
  author    = {Zhao, Xueliang and Wu, Wei and Guan, Jian and Gong, Zhuocheng and Kong, Lingpeng},
  journal   = {arXiv preprint arXiv:2509.19894},
  year      = {2025},
  url       = {https://arxiv.org/abs/2509.19894}
}

Downloads last month: 9

Safetensors

Model size

33B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xl-zhao/PromptCoT-2.0-Prompt-Generation-Model

Base model

Qwen/Qwen2.5-32B

Finetuned

(95)

this model

Quantizations

1 model

Collection including xl-zhao/PromptCoT-2.0-Prompt-Generation-Model

PromptCoT 2.0

Collection

9 items • Updated 13 days ago • 1