Update README.md
Browse files
README.md
CHANGED
|
@@ -62,6 +62,16 @@ Below we report performance across general, reasoning, mathematical, and coding
|
|
| 62 |
| **Coding Tasks** | | | | | | | | | |
|
| 63 |
| **MBPP** | 57.6 | 57.8 | 58.2 | 59.8 | 61.8 | **75.4** | <u>69.2</u> | 64.4 | 60.2 |
|
| 64 |
| **HUMANEVAL** | 50.0 | 51.2 | <u>53.7</u> | **54.3** | **54.3** | **54.3** | 42.1 | 50.6 | 36.0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
|
| 67 |
Below we report the evaluation results for K2-V2 after supervised fine-tuning (SFT). These variants correspond to three levels of reasoning effort (Low < Medium < High).
|
|
|
|
| 62 |
| **Coding Tasks** | | | | | | | | | |
|
| 63 |
| **MBPP** | 57.6 | 57.8 | 58.2 | 59.8 | 61.8 | **75.4** | <u>69.2</u> | 64.4 | 60.2 |
|
| 64 |
| **HUMANEVAL** | 50.0 | 51.2 | <u>53.7</u> | **54.3** | **54.3** | **54.3** | 42.1 | 50.6 | 36.0 |
|
| 65 |
+
| **Logic Puzzles** | | | | | | | | | |
|
| 66 |
+
| **COUNTDOWN** | 1.3 | <u>53.3</u> | 53.1 | 35.9 | **75.6** | 6.0 | 1.0 | 0.5 | 23.2 |
|
| 67 |
+
| **KK-4 PEOPLE** | 4.8 | 44.9 | <u>68.0</u> | 64.5 | **92.9** | 26.1 | 4.2 | 7.6 | 42.4 |
|
| 68 |
+
| **KK-8 PEOPLE** | 0.5 | 23.2 | 41.3 | <u>51.6</u> | **82.8** | 5.7 | 1.1 | 1.3 | 13.0 |
|
| 69 |
+
| **ORDER-15 ITEMS** | 4.7 | 30.7 | 47.2 | <u>55.8</u> | **87.6** | 37.0 | 3.5 | 4.5 | 25.0 |
|
| 70 |
+
| **ORDER-30 ITEMS** | 0.0 | 0.3 | 3.0 | <u>34.1</u> | **40.3** | 0.7 | 0.2 | 0.1 | 0.6 |
|
| 71 |
+
| **Instruction Following** | | | | | | | | | |
|
| 72 |
+
| **IFEVAL** | 17.4 | 26.2 | 28.5 | <u>34.5</u> | 26.7 | **40.3** | 15.1 | 17.4 | 13.2 |
|
| 73 |
+
| **Arabic** | | | | | | | | | |
|
| 74 |
+
| **MMLU-Arabic** | 65.4 | 66.1 | 64.5 | 66.6 | 65.5 | **74.1** | 65.0 | <u>66.8</u> | 47.8 |
|
| 75 |
|
| 76 |
|
| 77 |
Below we report the evaluation results for K2-V2 after supervised fine-tuning (SFT). These variants correspond to three levels of reasoning effort (Low < Medium < High).
|