PromptCoT 2.0 โ€” Problem Generation Model

This repository hosts the Problem Generation Model (PGM) used in PromptCoT 2.0, a framework for scalable prompt synthesis that advances LLM reasoning in mathematics and programming.


โœจ Overview

This checkpoint is the Problem Generation Model (PGM) of PromptCoT 2.0.

  • Input: a set of domain concepts (math or programming) and an optional difficulty tag.
  • Output: a rationale (the structured โ€œthinking processโ€ that connects the concepts) followed by a fully formed problem (Olympiad-level math or coding task).

How it fits into PromptCoT 2.0:
PromptCoT 2.0 jointly trains two models via an EM optimization loop:

  • Rationale Generator (E-step): infers rationales given concepts and problems, updated via reinforcement learning with reward signals.
  • Problem Generation Model (PGM) (M-step): learns to produce rationaleโ€“problem pairs conditioned only on concepts.

At inference time, the PGM is all you need: provide concepts and it will generate (rationale โ†’ problem) in one passโ€”without any handcrafted templates or domain-specific heuristics.


๐Ÿ“ฆ Model Details

  • Model type: Causal language model for problem generation.
  • Training data: Conceptโ€“rationaleโ€“problem triples synthesized and refined via PromptCoT 2.0.
  • Domains: Mathematics (Olympiad-level) and Programming (competitive programming).
  • Initialization: Warm-started from Qwen2.5-32B-Base with cold-start annotations (concepts & rationales) generated by instruction-tuned models.

๐Ÿš€ Usage

You can load this model with Hugging Face transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "xl-zhao/PromptCoT-2.0-Prompt-Generation-Model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

concept_text = "graph traversal, recursion, dynamic programming"
level = "codeforces"

prompt = f"""Given the foundational programming concepts and specified difficulty level, identify connections among these concepts and develop an olympiad-level coding problem that integrates them with appropriate complexity.

Foundational Programming Concepts:
{concept_text}

Difficulty Level: {level}"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=4096, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

๐Ÿ“‘ Prompt Templates

Use the following templates for inference. Replace {concept_text} and {level}.

Programming (code)

Given the foundational programming concepts and specified difficulty level, identify connections among these concepts and develop an olympiad-level coding problem that integrates them with appropriate complexity.

Foundational Programming Concepts:
{concept_text}

Difficulty Level: {level}

Mathematics (math)

Given the foundational mathematical concepts and specified difficulty level, identify connections among these concepts and develop an olympiad-level mathematical problem that integrates them with appropriate complexity.

Foundational Mathematical Concepts:
{concept_text}

Difficulty Level: {level}

Expected format: The output will first include a Rationale (multi-step explanation of how the concepts are combined) and then a precise Problem statement.


๐Ÿ”ฎ Applications

The PGM is the core component powering the creation of:

  • Self-Play datasets (math/code problems paired with verifiable answers or unit tests).
  • SFT datasets (problems with complete reasoning traces distilled from teacher models).

๐Ÿ“Š Results

PromptCoT 2.0 demonstrates that rationale-driven prompt synthesis yields harder and more diverse problems than existing datasets.

  • Self-Play (30B-A3B):
    Achieves strong gains in both mathematics and programming.

    • Math: 92.1 on AIME24, 89.8 on AIME25, 76.7 on HMMT Feb25.
    • Code: 74.2 on LiveCodeBench v5, 71.0 on v6, and 2079 Elo on Codeforces.
      Overall, performance is competitive with Gemini 2.5 Pro / OpenAI o3 and surpasses strong open-source baselines.
  • SFT (7B, 100% synthetic):
    Demonstrates that fully synthetic data can rival or outperform human-written datasets.

    • Math: 73.1 on AIME24, 65.6 on AIME25, 46.5 on HMMT Feb25.
    • Code: 53.4 on LiveCodeBench v5, 48.9 on v6, and 1815 Elo on Codeforces.
      These results exceed human-written baselines such as OpenMathReasoning and OpenCodeReasoning, highlighting the scalability of synthetic data.

๐Ÿ“‚ Resources


๐Ÿ“œ Citation

If you find this model or the PromptCoT 2.0 framework useful, please cite:

@article{zhao2025promptcot2,
  title     = {PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning},
  author    = {Zhao, Xueliang and Wu, Wei and Guan, Jian and Gong, Zhuocheng and Kong, Lingpeng},
  journal   = {arXiv preprint arXiv:2509.19894},
  year      = {2025},
  url       = {https://arxiv.org/abs/2509.19894}
}
Downloads last month
9
Safetensors
Model size
33B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for xl-zhao/PromptCoT-2.0-Prompt-Generation-Model

Base model

Qwen/Qwen2.5-32B
Finetuned
(95)
this model
Quantizations
1 model

Collection including xl-zhao/PromptCoT-2.0-Prompt-Generation-Model