Last update: 20 Oct. 2025

Introduction

We announce Motif-2-12.7B-Base, a 12.7 billion parameter language model. Detailed information including technical report will be released later.

Evaluation

All models listed in the table below are base models. The results of Qwen3 and Gemma 3 are sourced directly from their technical reports.

Benchmark	Evaluation setting	Motif-2-12.7B	Qwen3-14B	Qwen3-32B	Qwen3-30B-A3B	Gemma-3-12B	Gemma-3-27B
MMLU	5-shot	78.1	81.05	83.61	81.38	74.5	78.6
MMLU-Redux	5-shot	78.68	79.88	83.41	81.17	-	-
MMLU-Pro	5-shot, CoT	66.38	61.03	65.54	61.49	45.3	52.2
SuperGPQA	5-shot, CoT	32.68	34.27	39.78	35.72	-	-
BBH	3-shot, CoT	81.34	81.07	87.38	81.54	-	-
GPQA	5-shot, CoT	42.18	39.9	49.49	43.94	-	-
GPQA-Diamond	5-shot, CoT	42.92	-	-	-	25.4	24.3
GSM8K	4-shot, CoT	93.85	92.49	93.4	91.81	-	-
GSM8K	8-shot, CoT	94.92	-	-	-	71	82.6
MATH	4-shot, CoT	73.62	62.02	61.62	59.04	43.3	50
EvalPlus	0-shot	72.22	72.23	72.05	71.45	-	-
MBPP	3-shot	81.5	73.4	78.2	74.4	60.4	65.6
CRUX-O	1-shot	63.1	68.6	72.5	67.2	-	-
HumanEval	0-shot	65.9	-	-	-	45.7	48.8
DROP	1-shot	69.9	-	-	-	72.2	77.2
HellaSwag	10-shot	84	-	-	-	84.2	85.6
BoolQ	0-shot	78.5	-	-	-	78.8	82.4
PIQA	0-shot	81.6	-	-	-	81.8	83.3
SIQA	0-shot	53.8	-	-	-	53.4	54.9
TriviaQA	5-shot	72.2	-	-	-	78.2	85.5
Natural Question	5-shot	29.6	-	-	-	31.4	36.1
ARC-C	25-shot	69.6	-	-	-	68.9	70.6
ARC-E	0-shot	84.1	-	-	-	88.3	89
WinoGrande	5-shot	79.6	-	-	-	74.3	78.8
BBH	few-shot	81.3	-	-	-	72.6	77.7

	Motif-2-12.7B	Gemma-3-12B	Gemma-3-27B
Average	71.53	63.87	67.96
Improvement		+11.99%	+5.26%

	Motif-2-12.7B	Qwen3-14B	Qwen3-32B	Qwen3-30B-A3B
Average	69.42	67.81	71.54	68.10
Improvement		+2.37%	-2.96%	+1.94%

Safetensors

Model size

13B params

Tensor type

F32