Prakamya commited on
Commit
8e0d06c
·
verified ·
1 Parent(s): dc98eab

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +181 -181
README.md CHANGED
@@ -1,181 +1,181 @@
1
- ---
2
- license: other
3
- license_link: LICENSE
4
- library_name: transformers
5
- pipeline_tag: text-generation
6
- datasets:
7
- - amd/SAND-Post-Training-Dataset
8
-
9
- language:
10
- - en
11
- base_model:
12
- - Qwen/Qwen2.5-32B-Instruct
13
- ---
14
-
15
- # SAND-Reasoning: Best-in-class Large Reasoning Model Built with Synthetic Data only using AMD GPUs
16
-
17
- <div align="center">
18
-
19
- | [**📄 Technical Report**](https://arxiv.org/pdf/2507.20527) | [**💾 Synthetic Datasets**](https://huggingface.co/datasets/amd/SAND-Post-Training-Dataset) | [**💻 GitHub Repository**](https://huggingface.co/datasets/amd/SAND-Post-Training-Dataset) | [**📝 Blog Post**](https://rocm.blogs.amd.com/artificial-intelligence/sand-math/README.html) |
20
- | :---: | :---: | :---: | :---: |
21
-
22
- </div>
23
-
24
- ---
25
-
26
- ## Model Summary
27
-
28
- We introduce **SAND-Math-Qwen2.5-32B** and **SAND-MathScience-DeepSeek-Qwen32B**, reasoning models built entirely using a synthetic data pipeline running on the **AMD ROCm™ stack** and **AMD Instinct™ MI325 GPUs**.
29
-
30
- By prioritizing data difficulty along with quantity, we demonstrate that high-difficulty synthetic data can elevate prior-generation models to match or exceed modern proprietary models. `SAND-Math-Qwen2.5-32B` is fine-tuned from **Qwen2.5-32B-Instruct** on just **14k synthetic math samples**, achieving strong reasoning capabilities with minimal data outperforming other data distillation and post training approaches. `SAND-MathScience-DeepSeek-Qwen32B` is fine-tuned from **DeepSeek-R1-Distill-Qwen-32B** on a compact dataset of **27k samples** (15k Math + 12k Science), achieving a generational leap in performance that rivals **Qwen3-32B**.
31
-
32
- We are releasing the models, datasets, and code to empower the community to build their own state-of-the-art reasoning models using AMD hardware.
33
-
34
- ## 📊 Benchmark Results
35
-
36
- We conducted extensive experiments to validate that our pipeline yields superior results compared to models trained on significantly larger datasets.
37
-
38
- ### 1. Bridging the Generational Gap
39
- Fine-tuning the Qwen2.5-based **DeepSeek-R1-Distill-Qwen-32B** on our mixed Math/Science dataset allows it to rival and even surpass the next-generation **Qwen3-32B** on key benchmarks.
40
-
41
- | Model | AIME24 | AIME25 | MATH500 | GPQA |
42
- | :--- | :---: | :---: | :---: | :---: |
43
- | DeepSeek-Distilled-Qwen32B (Base) | 72.6 | 54.9 | 94.3 | 62.1 |
44
- | EXAONE Deep 32B | 72.1 | 65.8 | 95.8 | 66.1 |
45
- | Qwen3-32B (Thinking mode) | 81.4 | 72.9 | **97.0** | 68.4 |
46
- | **SAND-MathScience-DeepSeek-Qwen32B (Ours)** | **83.85** | **78.33** | 93.85 | **68.72** |
47
-
48
- ### 2. Efficiency: Unlocking Reasoning with Less Data
49
- Using only **14k synthetic math samples** and standard SFT (no RL), our approach outperforms models trained on datasets 5x to 50x larger.
50
-
51
- | Model | Data Size | AIME24 | AIME25 | MATH500 | GPQA |
52
- | :--- | :--- | :---: | :---: | :---: | :---: |
53
- | Qwen2.5-32B-Instruct (Base) | - | 16.7 | 13.3 | 83.4 | 53.5 |
54
- | DeepSeek-R1-Distill-Qwen-32B | 800k | 72.6 | 54.9 | 94.3 | 62.1 |
55
- | Light-R1-32B | 79k | 73.0 | 64.3 | 93.3 | 60.6 |
56
- | OpenThinker-32B | 114k | 66.0 | 53.3 | 89.4 | 57.6 |
57
- | **SAND-Math-Qwen2.5-32B (Ours)** | **14k** | **74.01** | **68.18** | **92.05** | **60.8** |
58
-
59
- ---
60
-
61
- ## ⚙️ The Synthetic Data Pipeline
62
-
63
- Our results are powered by a 4-stage automated pipeline running on AMD hardware that prioritizes **difficulty and novelty** over volume. Unlike datasets that recycle easy problems, our pipeline leverages a Teacher Model (`GPT-OSS120b`) to generate, validate, and systematically "hike" the difficulty of reasoning problems.
64
-
65
- ![Pipeline Overview](PipelineSimple.png)
66
-
67
- ### Pipeline Stages
68
-
69
- 1. **Stage 1: QA Generation & Consistency** 🛠️
70
- - Generates novel problems from scratch
71
- - Enforces correctness by requiring the teacher to generate multiple independent solution paths
72
- - Only questions where all answers align are kept
73
-
74
- 2. **Stage 2: De-duplication & Decontamination** 🧹
75
- - Removes internal duplicates via embedding similarity
76
- - **Crucial Step:** Scans against known test sets (AIME, MATH, GPQA) to ensure zero contamination
77
-
78
- 3. **Stage 3: Difficulty Hiking** 🏔️
79
- - Moderately challenging questions are rewritten by the teacher model
80
- - Introduces deeper reasoning chains, added constraints, or cross-domain logic
81
- - Systematically elevates complexity
82
- - Configurable step primarily used when initial generation yields insufficient volume of high-difficulty samples
83
-
84
- ---
85
-
86
- ## 🚀 Quick Start
87
-
88
- ### Python Inference (Transformers)
89
-
90
- ```python
91
- from transformers import AutoModelForCausalLM, AutoTokenizer
92
-
93
- model_name = "amd/SAND-Math-Qwen2.5-32B"
94
-
95
- model = AutoModelForCausalLM.from_pretrained(
96
- model_name,
97
- torch_dtype="auto",
98
- device_map="auto"
99
- )
100
- tokenizer = AutoTokenizer.from_pretrained(model_name)
101
-
102
- # Example prompt
103
- prompt = "Find the number of pairs of positive integers $(m, n)$ such that $m^2 + n < 22$ and $n^2 + m < 22$."
104
- messages = [
105
- {"role": "user", "content": prompt}
106
- ]
107
- text = tokenizer.apply_chat_template(
108
- messages,
109
- tokenize=False,
110
- add_generation_prompt=True
111
- )
112
- model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
113
-
114
- generated_ids = model.generate(
115
- **model_inputs,
116
- max_new_tokens=4096,
117
- temperature=0.7, # Recommended temperature
118
- do_sample=True
119
- )
120
- generated_ids = [
121
- output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
122
- ]
123
-
124
- response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
125
- print("Response:", response)
126
- ```
127
-
128
- ### Serving (vLLM & SGLang)
129
-
130
- You can easily serve this model as an OpenAI-compatible API endpoint.
131
-
132
- **Using SGLang:**
133
- ```bash
134
- python -m sglang.launch_server --model-path amd/SAND-Math-Qwen2.5-32B --max-model-len 32768
135
- ```
136
-
137
- **Using vLLM:**
138
- ```bash
139
- vllm serve amd/SAND-Math-Qwen2.5-32B --max-model-len 32768
140
- ```
141
-
142
- ---
143
-
144
- ## 💡 Usage Recommendations
145
-
146
- To replicate our performance benchmarks and achieve the best reasoning results, we strongly recommend the following configurations:
147
-
148
- * **Temperature:** Set `temperature=0.7`. **DO NOT use greedy decoding**, as it can lead to performance degradation and repetitive loops.
149
- * **Prompting:** For mathematical problems, include a directive to enforce structure:
150
- > "Please reason step by step, and put your final answer within \boxed{}."
151
- * **Context Length:** We recommend allowing an output length of **32,768 tokens**. This ensures the model has sufficient space for long Chain-of-Thought (CoT) generation.
152
- * **Thinking Token:** It is recommended to enforce the model to initiate its response with the `<think>\n` token to trigger the reasoning mode effectively.
153
- * **Evaluation:** When benchmarking, conduct multiple passes (Pass@K) and average the results for stability.
154
-
155
- ---
156
-
157
- ## 📜 License
158
-
159
- This project is licensed under the **Open RAIL-MSD** license. This is an open, royalty-free license that permits commercial use, modification, and distribution of the dataset, models, and source code.
160
-
161
- The license includes standard use-based restrictions to prevent harmful applications (e.g., illegal activities, generating harmful content, high-risk applications). These restrictions are designed to promote responsible AI development while keeping the license permissive for legitimate use cases.
162
-
163
- For full license terms and conditions, please see the [LICENSE](https://github.com/AMD-AGI/sand-pipeline/blob/main/LICENSE.txt) file.
164
-
165
- ---
166
-
167
- ## Citation
168
-
169
- If you use this model, dataset, or pipeline in your research, please cite our work:
170
-
171
- ```bibtex
172
- @misc{manem025sandmathusingllmsgenerate,
173
- title={SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers},
174
- author={Chaitanya Manem and Pratik Prabhanjan Brahma and Prakamya Mishra and Zicheng Liu and Emad Barsoum},
175
- year={2025},
176
- eprint={2507.20527},
177
- archivePrefix={arXiv},
178
- primaryClass={cs.CL},
179
- url={https://arxiv.org/abs/2507.20527},
180
- }
181
- ```
 
1
+ ---
2
+ license: other
3
+ license_link: LICENSE
4
+ library_name: transformers
5
+ pipeline_tag: text-generation
6
+ datasets:
7
+ - amd/SAND-Post-Training-Dataset
8
+
9
+ language:
10
+ - en
11
+ base_model:
12
+ - Qwen/Qwen2.5-32B-Instruct
13
+ ---
14
+
15
+ # SAND-Reasoning: Best-in-class Large Reasoning Model Built with Synthetic Data only using AMD GPUs
16
+
17
+ <div align="center">
18
+
19
+ | [**📄 Technical Report**](https://arxiv.org/pdf/2507.20527) | [**💾 Synthetic Datasets**](https://huggingface.co/datasets/amd/SAND-Post-Training-Dataset) | [**💻 GitHub Repository**](https://huggingface.co/datasets/amd/SAND-Post-Training-Dataset) | [**📝 Blog Post**](https://rocm.blogs.amd.com/artificial-intelligence/sand-math/README.html) |
20
+ | :---: | :---: | :---: | :---: |
21
+
22
+ </div>
23
+
24
+ ---
25
+
26
+ ## Model Summary
27
+
28
+ We introduce **SAND-Math-Qwen2.5-32B** and **SAND-MathScience-DeepSeek-Qwen32B**, reasoning models built entirely using a synthetic data pipeline running on the **AMD ROCm™ stack** and **AMD Instinct™ MI325 GPUs**.
29
+
30
+ By prioritizing data difficulty along with quantity, we demonstrate that high-difficulty synthetic data can elevate prior-generation models to match or exceed modern proprietary models. `SAND-Math-Qwen2.5-32B` is fine-tuned from **Qwen2.5-32B-Instruct** on just **14k synthetic math samples**, achieving strong reasoning capabilities with minimal data outperforming other data distillation and post training approaches. `SAND-MathScience-DeepSeek-Qwen32B` is fine-tuned from **DeepSeek-R1-Distill-Qwen-32B** on a compact dataset of **27k samples** (15k Math + 12k Science), achieving a generational leap in performance that rivals **Qwen3-32B**.
31
+
32
+ We are releasing the models, datasets, and code to empower the community to build their own state-of-the-art reasoning models using AMD hardware.
33
+
34
+ ## 📊 Benchmark Results
35
+
36
+ We conducted extensive experiments to validate that our pipeline yields superior results compared to models trained on significantly larger datasets.
37
+
38
+ ### 1. Bridging the Generational Gap
39
+ Fine-tuning the Qwen2.5-based **DeepSeek-R1-Distill-Qwen-32B** on our mixed Math/Science dataset allows it to rival and even surpass the next-generation **Qwen3-32B** on key benchmarks.
40
+
41
+ | Model | AIME24 | AIME25 | MATH500 | GPQA |
42
+ | :--- | :---: | :---: | :---: | :---: |
43
+ | DeepSeek-Distilled-Qwen32B (Base) | 72.6 | 54.9 | 94.3 | 62.1 |
44
+ | EXAONE Deep 32B | 72.1 | 65.8 | 95.8 | 66.1 |
45
+ | Qwen3-32B (Thinking mode) | 81.4 | 72.9 | **97.0** | 68.4 |
46
+ | **SAND-MathScience-DeepSeek-Qwen32B (Ours)** | **83.85** | **78.33** | 93.85 | **68.72** |
47
+
48
+ ### 2. Efficiency: Unlocking Reasoning with Less Data
49
+ Using only **14k synthetic math samples** and standard SFT (no RL), our approach outperforms models trained on datasets 5x to 50x larger.
50
+
51
+ | Model | Data Size | AIME24 | AIME25 | MATH500 | GPQA |
52
+ | :--- | :--- | :---: | :---: | :---: | :---: |
53
+ | Qwen2.5-32B-Instruct (Base) | - | 16.7 | 13.3 | 83.4 | 53.5 |
54
+ | DeepSeek-R1-Distill-Qwen-32B | 800k | 72.6 | 54.9 | 94.3 | 62.1 |
55
+ | Light-R1-32B | 79k | 73.0 | 64.3 | 93.3 | 60.6 |
56
+ | OpenThinker-32B | 114k | 66.0 | 53.3 | 89.4 | 57.6 |
57
+ | **SAND-Math-Qwen2.5-32B (Ours)** | **14k** | **74.01** | **68.18** | **92.05** | **60.8** |
58
+
59
+ ---
60
+
61
+ ## ⚙️ The Synthetic Data Pipeline
62
+
63
+ Our results are powered by a 4-stage automated pipeline running on AMD hardware that prioritizes **difficulty and novelty** over volume. Unlike datasets that recycle easy problems, our pipeline leverages a Teacher Model (`GPT-OSS120b`) to generate, validate, and systematically "hike" the difficulty of reasoning problems.
64
+
65
+ ![Pipeline Overview](PipelineSimple.png)
66
+
67
+ ### Pipeline Stages
68
+
69
+ 1. **Stage 1: QA Generation & Consistency** 🛠️
70
+ - Generates novel problems from scratch
71
+ - Enforces correctness by requiring the teacher to generate multiple independent solution paths
72
+ - Only questions where all answers align are kept
73
+
74
+ 2. **Stage 2: De-duplication & Decontamination** 🧹
75
+ - Removes internal duplicates via embedding similarity
76
+ - **Crucial Step:** Scans against known test sets (AIME, MATH, GPQA) to ensure zero contamination
77
+
78
+ 3. **Stage 3: Difficulty Hiking** 🏔️
79
+ - Moderately challenging questions are rewritten by the teacher model
80
+ - Introduces deeper reasoning chains, added constraints, or cross-domain logic
81
+ - Systematically elevates complexity
82
+ - Configurable step primarily used when initial generation yields insufficient volume of high-difficulty samples
83
+
84
+ ---
85
+
86
+ ## 🚀 Quick Start
87
+
88
+ ### Python Inference (Transformers)
89
+
90
+ ```python
91
+ from transformers import AutoModelForCausalLM, AutoTokenizer
92
+
93
+ model_name = "amd/SAND-Math-Qwen2.5-32B"
94
+
95
+ model = AutoModelForCausalLM.from_pretrained(
96
+ model_name,
97
+ torch_dtype="auto",
98
+ device_map="auto"
99
+ )
100
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
101
+
102
+ # Example prompt
103
+ prompt = "Find the number of pairs of positive integers $(m, n)$ such that $m^2 + n < 22$ and $n^2 + m < 22$."
104
+ messages = [
105
+ {"role": "user", "content": prompt}
106
+ ]
107
+ text = tokenizer.apply_chat_template(
108
+ messages,
109
+ tokenize=False,
110
+ add_generation_prompt=True
111
+ )
112
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
113
+
114
+ generated_ids = model.generate(
115
+ **model_inputs,
116
+ max_new_tokens=4096,
117
+ temperature=0.7, # Recommended temperature
118
+ do_sample=True
119
+ )
120
+ generated_ids = [
121
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
122
+ ]
123
+
124
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
125
+ print("Response:", response)
126
+ ```
127
+
128
+ ### Serving (vLLM & SGLang)
129
+
130
+ You can easily serve this model as an OpenAI-compatible API endpoint.
131
+
132
+ **Using SGLang:**
133
+ ```bash
134
+ python -m sglang.launch_server --model-path amd/SAND-Math-Qwen2.5-32B --max-model-len 32768
135
+ ```
136
+
137
+ **Using vLLM:**
138
+ ```bash
139
+ vllm serve amd/SAND-Math-Qwen2.5-32B --max-model-len 32768
140
+ ```
141
+
142
+ ---
143
+
144
+ ## 💡 Usage Recommendations
145
+
146
+ To replicate our performance benchmarks and achieve the best reasoning results, we strongly recommend the following configurations:
147
+
148
+ * **Temperature:** Set `temperature=0.7`. **DO NOT use greedy decoding**, as it can lead to performance degradation and repetitive loops.
149
+ * **Prompting:** For mathematical problems, include a directive to enforce structure:
150
+ > "Please reason step by step, and put your final answer within \boxed{}."
151
+ * **Context Length:** We recommend allowing an output length of **32,768 tokens**. This ensures the model has sufficient space for long Chain-of-Thought (CoT) generation.
152
+ * **Thinking Token:** It is recommended to enforce the model to initiate its response with the `<think>\n` token to trigger the reasoning mode effectively.
153
+ * **Evaluation:** When benchmarking, conduct multiple passes (Pass@K) and average the results for stability.
154
+
155
+ ---
156
+
157
+ ## 📜 License
158
+
159
+ This project is licensed under the **Open RAIL-MSD** license. This is an open, royalty-free license that permits commercial use, modification, and distribution of the dataset, models, and source code.
160
+
161
+ The license includes standard use-based restrictions to prevent harmful applications (e.g., illegal activities, generating harmful content, high-risk applications). These restrictions are designed to promote responsible AI development while keeping the license permissive for legitimate use cases.
162
+
163
+ For full license terms and conditions, please see the [LICENSE](https://github.com/AMD-AGI/sand-pipeline/blob/main/LICENSE.txt) file.
164
+
165
+ ---
166
+
167
+ ## Citation
168
+
169
+ If you use this model, dataset, or pipeline in your research, please cite our work:
170
+
171
+ ```bibtex
172
+ @misc{manem025sandmathusingllmsgenerate,
173
+ title={SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers},
174
+ author={Chaitanya Manem and Pratik Prabhanjan Brahma and Prakamya Mishra and Zicheng Liu and Emad Barsoum},
175
+ year={2025},
176
+ eprint={2507.20527},
177
+ archivePrefix={arXiv},
178
+ primaryClass={cs.CL},
179
+ url={https://arxiv.org/abs/2507.20527},
180
+ }
181
+ ```