--- license: apache-2.0 datasets: - ai2-adapt-dev/rlvr_gsm8k_zs metrics: - accuracy base_model: - swiss-ai/Apertus-8B-Instruct-2509 pipeline_tag: reinforcement-learning tags: - rlvr - grpo - gsm8k - apertus results: - task: type: text-generation dataset: name: gsm8k type: mathematical split: test metrics: - name: GSM8K (Validation, 0-shot, T=0) Accuracy type: GSM8K (0-shot, T=0) value: 66.23 --- # RLVR Training Apertus 8B with GRPO on GSM8K dataset ## Results

Validation accuracy improved from 46.41% to 66.23%.

**Figure 1.** Full validation set accuracy.

**Figure 2.** Validation set, average tokenes used per senquenced (capped at 512 tokens).

Training reward curve — **Figure 3.** Training reward.

---

Compute

Training performed on a GPU node with 4× NVIDIA H100 (95 GB), running for approximately 5 hours.

---

Hyperparameters

Rollouts
`num_unique_prompts_rollout`	32
`num_samples_per_prompt_rollout`	8
`temperature`	0.8
Optimization
`learning_rate`	3.0e-7
`beta`	0.01

--- ## Notes - Note: format reward was not applied because neither the instruct or the base models were able to get a correct answer. Thus the model is not able to use ` `. - Funny observation: the model memorized the dataset. in one try, the model answered the question but because the format was not familiar, it started reciting another question from same dataset; another time it outputed the html code, assumingly from where it saw the question. ---

Acknowledgements

This work builds upon and was inspired by the following contributions:

RLVR: Verifiable Rewards for Reasoning Models — for introducing the verifiable reward framework used in this experiment.
Allen Institute for AI — Open Instruct — for providing open-source infrastructure for RLHF/RLVR training.
Apertus Project — for releasing the Apertus-8B base and instruct models used in this work.