--- license: apache-2.0 datasets: - ai2-adapt-dev/rlvr_gsm8k_zs metrics: - accuracy base_model: - swiss-ai/Apertus-8B-Instruct-2509 pipeline_tag: reinforcement-learning tags: - rlvr - grpo - gsm8k - apertus results: - task: type: text-generation dataset: name: gsm8k type: mathematical split: test metrics: - name: GSM8K (Validation, 0-shot, T=0) Accuracy type: GSM8K (0-shot, T=0) value: 66.23 --- # RLVR Training Apertus 8B with GRPO on GSM8K dataset ## Results

Validation accuracy improved from 46.41% to 66.23%.

Full validation set accuracy
Figure 1. Full validation set accuracy.
Validation set, average tokenes used per senquenced (capped at 512 tokens)
Figure 2. Validation set, average tokenes used per senquenced (capped at 512 tokens).
Training reward curve
Figure 3. Training reward.
---

Compute

Training performed on a GPU node with 4× NVIDIA H100 (95 GB), running for approximately 5 hours.

---

Hyperparameters

Rollouts
num_unique_prompts_rollout32
num_samples_per_prompt_rollout8
temperature0.8
Optimization
learning_rate3.0e-7
beta0.01
--- ## Notes - Note: format reward was not applied because neither the instruct or the base models were able to get a correct answer. Thus the model is not able to use ` `. - Funny observation: the model memorized the dataset. in one try, the model answered the question but because the format was not familiar, it started reciting another question from same dataset; another time it outputed the html code, assumingly from where it saw the question. ---

Acknowledgements

This work builds upon and was inspired by the following contributions: