RLVR Training Apertus 8B with GRPO on GSM8K dataset
Results
Validation accuracy improved from 46.41% to 66.23%.
 
   
   
  Compute
Training performed on a GPU node with 4× NVIDIA H100 (95 GB), running for approximately 5 hours.
Hyperparameters
| Rollouts | |
|---|---|
| num_unique_prompts_rollout | 32 | 
| num_samples_per_prompt_rollout | 8 | 
| temperature | 0.8 | 
| Optimization | |
| learning_rate | 3.0e-7 | 
| beta | 0.01 | 
Notes
- Note: format reward was not applied because neither the instruct or the base models were able to get a correct answer. Thus the model is not able to use <think> </think>.
- Funny observation: the model memorized the dataset. in one try, the model answered the question but because the format was not familiar, it started reciting another question from same dataset; another time it outputed the html code, assumingly from where it saw the question.
Acknowledgements
This work builds upon and was inspired by the following contributions:
- RLVR: Verifiable Rewards for Reasoning Models — for introducing the verifiable reward framework used in this experiment.
- Allen Institute for AI — Open Instruct — for providing open-source infrastructure for RLHF/RLVR training.
- Apertus Project — for releasing the Apertus-8B base and instruct models used in this work.
Model tree for ABaroian/Apertus-8B-RLVR-GSM
Base model
swiss-ai/Apertus-8B-2509
				Finetuned
	
	
swiss-ai/Apertus-8B-Instruct-2509
						