File size: 5,085 Bytes

---
language:
- en
library_name: transformers
tags:
- reasoning
- reinforcement-learning
- rlvr
- mcts
- math
- iclr-2026
license: apache-2.0
datasets:
- DeepMath-103K
model-index:
- name: DeepSearch-1.5B
  results:
  - task:
      name: Mathematical Reasoning
      type: text-generation
    dataset:
      name: AIME 2024
      type: text
    metrics:
    - type: pass@1
      value: 53.65
  - task:
      name: Mathematical Reasoning
      type: text-generation
    dataset:
      name: AIME 2025
      type: text
    metrics:
    - type: pass@1
      value: 35.42
  - task:
      name: Mathematical Reasoning
      type: text-generation
    dataset:
      name: AMC 2023
      type: text
    metrics:
    - type: pass@1
      value: 90.39
  - task:
      name: Mathematical Reasoning
      type: text-generation
    dataset:
      name: MATH500
      type: text
    metrics:
    - type: pass@1
      value: 92.53
  - task:
      name: Mathematical Reasoning
      type: text-generation
    dataset:
      name: Minerva
      type: text
    metrics:
    - type: pass@1
      value: 40.00
  - task:
      name: Mathematical Reasoning
      type: text-generation
    dataset:
      name: Olympiad
      type: text
    metrics:
    - type: pass@1
      value: 65.72
---
<div align="center">
<span style="font-family: default; font-size: 1.5em;">🚀 DeepSearch-1.5B</span>
</div>

**DeepSearch-1.5B🌟** is a 1.5B parameter reasoning model trained with **Reinforcement Learning with Verifiable Rewards (RLVR)**, enhanced by **Monte Carlo Tree Search (MCTS)**.  
Unlike prior approaches that restrict structured search to inference, DeepSearch integrates MCTS *into training*, enabling systematic exploration, fine-grained credit assignment, and efficient replay buffering.

This model achieves **state-of-the-art accuracy among 1.5B reasoning models** while being **72× more compute-efficient** than extended RL training baselines.

![Illstration of DeepSearch algorithm](./deepsearch.png)


---

## Model Details

- **Developed by**: Fang Wu\*, Weihao Xuan\*, Heli Qi\*, Ximing Lu, Aaron Tu, Li Erran Li, Yejin Choi  
- **Institutional affiliations**: Stanford University, University of Tokyo, RIKEN AIP, University of Washington, UC Berkeley, Amazon AWS, Columbia University  
- **Paper**: DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search
- **Base Model**: Nemotron-Research-Reasoning-Qwen-1.5B v2
- **Parameters**: 1.5B  
- **Framework**: veRL
- **License**: Apache-2.0  

---

## Quickstart

### Environment
```
pip install vllm # vllm>=v0.8.5.post1 should work
pip install transformers # transformers>=4.52.4 should work
```


### Using vLLM to generate
```python
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer


def convert_question_to_messages(question: str):
    messages = [
        {"role": "user",
         "content": question + " Let's think step by step and output the final answer within \\boxed{}."}
    ]
    return messages


model_id="ethan1115/DeepSearch-1.5B"
tokenizer = AutoTokenizer.from_pretrained(model_id)

sampling_params = SamplingParams(
    temperature=0.6, 
    top_p=0.95, 
    max_tokens=32768
)

model = LLM(
    model=model_id,
    tensor_parallel_size=1
)
prompt = tokenizer.apply_chat_template(
    convert_question_to_messages("Find the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$."), 
    add_generation_prompt=True, 
    tokenize=False
)

outputs = model.generate({"prompt": prompt}, sampling_params=sampling_params, use_tqdm=False)
response = outputs[0].outputs[0].text
print(response)
```

## Performance

| Benchmark | Nemotron-RR-Qwen-1.5B v2 | DeepSearch-1.5B |
|-----------|--------------------------|-----------------|
| AIME 2024 | 51.77 | **53.65** |
| AIME 2025 | 32.92 | **35.42** |
| AMC 2023  | 88.83 | **90.39** |
| MATH500   | 92.24 | **92.53** |
| Minerva   | 39.75 | **40.00** |
| Olympiad  | 64.69 | **65.72** |
| **Average** | 61.70 | **62.95** |

DeepSearch improves average accuracy by **+1.25 points** over the best prior 1.5B model, while using **5.7× fewer GPU hours**.


## Training

- **Dataset**: DeepMath-103K (rigorously decontaminated)  
- **Training steps**: 100  
- **Search strategy**:  
  - Global Frontier Selection  
  - Entropy-based guidance  
  - Replay buffer with solution caching  
- **Hardware**: 16× NVIDIA H100 (96GB)  
- **Compute**: ~330 GPU hours  

---

## Ethical Considerations

- Positive: Reduces training costs and carbon footprint.  
- Risks: Systematic exploration methods could be adapted to sensitive domains (e.g., code synthesis).  
- Transparency: Full implementation and training details are released for reproducibility.  

---

## Citation

```bibtex
@inproceedings{wu2026deepsearch,
  title={DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search},
  author={Fang Wu and Weihao Xuan and Heli Qi and Ximing Lu and Aaron Tu and Li Erran Li and Yejin Choi},
  booktitle={arXiv},
  year={2026}
}