Last update: 20 Oct. 2025

Introduction

We announce Motif-2-12.7B-Base, a 12.7 billion parameter language model. Detailed information including technical report will be released later.

Evaluation

All models listed in the table below are base models. The results of Qwen3 and Gemma 3 are sourced directly from their technical reports.

Benchmark Evaluation setting Motif-2-12.7B Qwen3-14B Qwen3-32B Qwen3-30B-A3B Gemma-3-12B Gemma-3-27B
MMLU 5-shot 78.1 81.05 83.61 81.38 74.5 78.6
MMLU-Redux 5-shot 78.68 79.88 83.41 81.17 - -
MMLU-Pro 5-shot, CoT 66.38 61.03 65.54 61.49 45.3 52.2
SuperGPQA 5-shot, CoT 32.68 34.27 39.78 35.72 - -
BBH 3-shot, CoT 81.34 81.07 87.38 81.54 - -
GPQA 5-shot, CoT 42.18 39.9 49.49 43.94 - -
GPQA-Diamond 5-shot, CoT 42.92 - - - 25.4 24.3
GSM8K 4-shot, CoT 93.85 92.49 93.4 91.81 - -
GSM8K 8-shot, CoT 94.92 - - - 71 82.6
MATH 4-shot, CoT 73.62 62.02 61.62 59.04 43.3 50
EvalPlus 0-shot 72.22 72.23 72.05 71.45 - -
MBPP 3-shot 81.5 73.4 78.2 74.4 60.4 65.6
CRUX-O 1-shot 63.1 68.6 72.5 67.2 - -
HumanEval 0-shot 65.9 - - - 45.7 48.8
DROP 1-shot 69.9 - - - 72.2 77.2
HellaSwag 10-shot 84 - - - 84.2 85.6
BoolQ 0-shot 78.5 - - - 78.8 82.4
PIQA 0-shot 81.6 - - - 81.8 83.3
SIQA 0-shot 53.8 - - - 53.4 54.9
TriviaQA 5-shot 72.2 - - - 78.2 85.5
Natural Question 5-shot 29.6 - - - 31.4 36.1
ARC-C 25-shot 69.6 - - - 68.9 70.6
ARC-E 0-shot 84.1 - - - 88.3 89
WinoGrande 5-shot 79.6 - - - 74.3 78.8
BBH few-shot 81.3 - - - 72.6 77.7

Averages and improvements of the corresponding benchmark scores:

v.s. Gemma 3-Base

Motif-2-12.7B Gemma-3-12B Gemma-3-27B
Average 71.53 63.87 67.96
Improvement +11.99% +5.26%

v.s. Qwen3-Base

Motif-2-12.7B Qwen3-14B Qwen3-32B Qwen3-30B-A3B
Average 69.42 67.81 71.54 68.10
Improvement +2.37% -2.96% +1.94%
Downloads last month
527
Safetensors
Model size
13B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support