Text Generation
Transformers
Safetensors
llama
text-generation-inference
Files changed (1) hide show
  1. README.md +45 -159
README.md CHANGED
@@ -41,7 +41,6 @@ datasets:
41
  - uonlp/CulturaX
42
  - bigcode/the-stack
43
  - common-pile/arxiv_papers
44
- library_name: transformers
45
  ---
46
  **Developed by:** [Tilde.ai](https://tilde.ai/tildeopen-llm/)
47
  **Funded by:** European Commission via [EuroHPC JU Large AI Grand Challenge](https://www.eurohpc-ju.europa.eu/winners-announced-large-ai-grand-challenge-2024-06-26_en)
@@ -105,161 +104,48 @@ outputs = model.generate(
105
  )
106
  ```
107
  # Evaluation
108
- ## Belebele Benchmark: Reading Comprehension
109
- **What is Belebele Benchmark?** [Belebele](https://aclanthology.org/anthology-files/anthology-files/pdf/acl/2024.acl-long.44.pdf) is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. This dataset enables the evaluation of mono- and multi-lingual models in high-, medium-, and low-resource languages. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset. The human annotation procedure was carefully curated to create questions that discriminate between different levels of generalizable language comprehension and is reinforced by extensive quality checks.
110
- Results
111
-
112
- **Why does this Matter?** Belebele tests LLM's ability to provide answers based on a given text -- a standard use case in retrieval augumented generation workflows.
113
-
114
- **What did we do?** We used the standard implementation of the [belebele](https://github.com/eleutherai/lm-evaluation-harness/tree/main/lm_eval/tasks/belebele) task from the LLM Evaluation Harness. We set tokenisers to ```use_fast=False```. We report **5-shot** accuracy.
115
-
116
- | 5-shot | **Gemma 2 27b** | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b** |
117
- |----------|:-------------:|:----------:|:------------:|:-------------------:|
118
- | Bulgarian | 79.8% | 78.8% | **85.3%** | 84.7% |
119
- | Czech | 81.4% | 78.3% | 85.3% | **85.8%** |
120
- | German | 81.2% | 80.6% | **85.0%** | 84.3% |
121
- | English | **88.9%** | 83.0% | 87.6% | 88.3% |
122
- | Estonian | 72.1% | 73.7% | 82.0% | **82.6%** |
123
- | Finnish | 79.0% | 78.1% | 84.3% | **85.0%** |
124
- | French | 82.6% | 80.1% | **85.7%** | 85.0% |
125
- | Hungarian | 77.9% | 76.2% | 83.3% | **86.2%** |
126
- | Icelandic | 70.8% | 58.2% | 54.3% | **85.7%** |
127
- | Italian | 82.1% | 77.8% | 81.0% | **82.4%** |
128
- | Lithuanian | 76.1% | 76.1% | **85.2%** | 83.3% |
129
- | Latvian | 78.4% | 77.7% | **84.6%** | **84.6%** |
130
- | Dutch | 80.2% | 78.9% | 83.2% | **85.0%** |
131
- | Polish | 78.3% | 77.9% | 82.2% | **83.0%** |
132
- | Portuguese | 83.8% | 80.1% | 86.1% | **87.1%** |
133
- | Romanian | 80.3% | 78.8% | 85.3% | **85.9%** |
134
- | Russian | 79.4% | 79.4% | 84.2% | **84.6%** |
135
- | Slovak | 78.9% | 78.0% | 84.1% | **85.0%** |
136
- | Slovenian | 78.0% | 80.0% | 83.7% | **85.1%** |
137
- | Spanish | 82.1% | 78.4% | **84.1%** | 83.8% |
138
- | Serbian | 79.8% | 78.4% | 74.1% | **84.2%** |
139
- | Swedish | 80.6% | 76.3% | **85.3%** | 84.4% |
140
- | Turkish | 77.4% | 62.3% | 79.9% | **82.7%** |
141
- | Ukrainian | 78.0% | 77.0% | 83.9% | **85.1%** |
142
- | **Average** | 79.5% | 76.8% | 82.5% | **84.7%** |
143
-
144
- ## MultiBLiMP Benchmark: Grammar Test
145
- **What is MultiBLiMP?** [MultiBLiMP](https://arxiv.org/pdf/2504.02768) is a massively multilingual test of core grammar. It gives models pairs of almost-identical sentences—one grammatical and one ungrammatical—and asks whether the model assigns a higher probability to the correct one. Version 1.0 covers 101 languages
146
-
147
- **Why does this Matter?** MultiBLiMP tests models' ability to distinguish correct and erroneous language. Just like humans, producing mostly correct language is not a big achievement. Rather, it is very bad to make any mistakes at all.
148
-
149
- **What did we do?**
150
- We used the standard implementation of the [MultiBLiMP](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/multiblimp) task from the LLM Evaluation Harness. We set tokenisers to ```use_fast=False```. We report **0-shot** accuracy.
151
-
152
- | Language | **Gemma 2 27b** | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b**
153
- |----------|-------------|----------|---------------------|-------------|
154
- | Bulgarian | 95.4% | 98.8% | 97.7% | **99.6%** |
155
- | Czech | 98.6% | **98.9%** | 98.5% | 98.5% |
156
- | German | 98.8% | 98.7% | 98.0% | **99.4%** |
157
- | English | 98.4% | 98.7% | 98.7% | **99.4%** |
158
- | Estonian | 92.0% | 95.6% | 95.8% | **98.3%** |
159
- | Finnish | 93.0% | 96.3% | 95.2% | **98.5%** |
160
- | French | 98.2% | 98.8% | 98.7% | **99.3%** |
161
- | Serbo-Croatian | 94.6% | 98.5% | 96.4% | **99.6%** |
162
- | Hungarian | 95.9% | 98.8% | 97.8% | **100.0%** |
163
- | Icelandic | 88.5% | 80.3% | 74.4% | **98.8%** |
164
- | Italian | 96.0% | 96.7% | 96.6% | **98.2%** |
165
- | Latvian | 91.6% | 95.2% | 96.9% | **99.1%** |
166
- | Lithuanian | 95.3% | 99.0% | 99.0% | **99.7%** |
167
- | Dutch | 94.0% | 96.6% | 96.5% | **99.2%** |
168
- | Polish | 97.0% | 97.5% | 97.6% | **99.3%** |
169
- | Portuguese | 96.1% | 97.6% | 97.1% | **98.2%** |
170
- | Romanian | 97.7% | 98.9% | 98.5% | **98.9%** |
171
- | Russian | 94.7% | 96.6% | 97.3% | **99.4%** |
172
- | Slovak | 97.7% | 98.8% | 97.7% | **99.3%** |
173
- | Slovenian | 99.0% | **100.0%** | **100.0%** | 98.8% |
174
- | Spanish | 95.6% | 98.0% | 97.3% | **98.7%** |
175
- | Swedish | 95.8% | 85.1% | 93.8% | **100.0%** |
176
- | Turkish | 97.6% | **98.7%** | 97.9% | 96.4% |
177
- | Ukrainian | 95.6% | 98.0% | 97.3% | **99.2%** |
178
- | **Average** | 95.7% | 96.7% | 96.4% | **99.0%** |
179
-
180
- ## Knowledge tests
181
-
182
- ### ARC Benchmark Results
183
- **What is ARC?** [ARC](https://arxiv.org/pdf/1803.05457) - The AI2 Reasoning Challenge is a multiple-choice science question benchmark **in English**, derived from U.S. grade-school standardized exams. It has two subsets — ARC Easy and ARC Challenge — designed to test factual knowledge and common-sense.
184
-
185
- **Why does this Matter?** ARC probes a model’s ability to answer non-trivial questions by applying world knowledge. Although the answer can sometimes be inferred from the question, in the classic lm-evaluation-harness ARC implementation the answer choices for each question are **not** provided during inference, thus placing emphasis on world knowledge, rather than on the model's reasoning capabilities.
186
-
187
- **What did we do?**
188
- We use multilingual translations of ARC provided by [Eurolingua](https://huggingface.co/datasets/Eurolingua/arcx); please refer to the [publication](https://arxiv.org/pdf/2410.08928). Other than the data source, we replicate the standard [LM Evaluation Harness configuration for ARC](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/arc). Our exact configuration is available at [TBA]. We set tokenisers to ```use_fast=False```. We report **5-shot** accuracy.
189
-
190
- | 5-shot | | **ARC Easy**| | | **ARC Challenge**| |
191
- |----------|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
192
- | **Language** | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b** | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b** |
193
- | Danish | 79.9% | **80.1%** | 79.6% | 53.4% | 52.6% | **53.7%** |
194
- | German | 79.6% | **79.9%** | 78.0% | 53.4% | **53.6%** | 51.7% |
195
- | Spanish | **82.9%** | 81.7% | 79.4% | **57.3%** | 56.1% | 52.4% |
196
- | French | **81.7%** | 81.1% | 78.6% | **56.0%** | 54.5% | 52.8% |
197
- | Italian | 80.5% | **81.6%** | 78.5% | **56.4%** | 54.8% | 54.1% |
198
- | Dutch | **80.1%** | 80.0% | 78.8% | **54.0%** | 53.8% | 52.2% |
199
- | Portuguese | **81.7%** | 81.1% | 79.0% | **56.9%** | 55.5% | 54.1% |
200
- | Swedish | 80.3% | **80.5%** | 78.7% | 53.8% | 53.1% | **54.1%** |
201
- | **AVG WEST** | **80.8%** | **80.8%** | 78.8% | **55.2%** | 54.2% | 53.1% |
202
- | | | | | | | |
203
- | Bulgarian | **79.8%** | 79.2% | 79.5% | **53.8%** | 51.8% | 52.8% |
204
- | Czech | **79.5%** | **79.5%** | 78.8% | 51.5% | 52.3% | **53.9%** |
205
- | Estonian | 72.4% | 73.0% | **73.1%** | 49.6% | 49.8% | **52.0%** |
206
- | Finnish | 73.8% | **74.2%** | 73.3% | 48.7% | 51.1% | **52.1%** |
207
- | Hungarian | 74.0% | 73.9% | **74.9%** | 49.3% | 49.0% | **49.6%** |
208
- | Lithuanian | 76.4% | 76.1% | **77.9%** | 50.3% | 51.6% | **53.0%** |
209
- | Latvian | 76.2% | **76.4%** | 75.9% | 50.7% | 49.8% | **50.9%** |
210
- | Polish | **79.2%** | 78.2% | 78.0% | **54.5%** | 53.3% | 52.7% |
211
- | Romanian | **79.6%** | 78.8% | 78.8% | **55.5%** | 53.7% | 54.5% |
212
- | Slovak | 78.8% | 79.2% | **79.6%** | 52.5% | 53.0% | **54.7%** |
213
- | Slovenian | **78.3%** | 78.5% | **78.3%** | **53.4%** | 52.2% | 52.7% |
214
- | **AVG EAST** | **77.1%** | 77.0% | **77.1%** | 51.8% | 51.6% | **52.6%** |
215
-
216
- ### MMLU Benchmark Results
217
- **What is MMLU?** [MMLU](https://arxiv.org/pdf/2009.03300) is a massive multitask test consisting of multiple-choice questions from various branches of knowledge, **in English**. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. Questions are four option multiple choice and assess factual knowledge, reading comprehension, and reasoning across disciplines. The questions can be grouped under four topics - stem, humanities, social_sciences and other, allowing for individual evaluation of each group.
218
-
219
- **Why does this Matter?** Similarly to ARC, MMLU measures broad, general purpose factual knowledge and some reasoning capabilites. The possible answer choices are included during prompting, which can allow the model to employ reasoning to discard false answers, rather than just relying on knowing the correct one. It should be noted that some question groups are exclusive to the anglocentric world, e.g. US history or law.
220
-
221
- **What did we do?** We use multilingual translations of MMLU provided by [Eurolingua](https://huggingface.co/datasets/Eurolingua/mmlux), please refer to the [publication](https://arxiv.org/pdf/2410.08928). Other than the data source, we replicate the standard [LM Evaluation Harness configuration for MMLU](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/mmlu/default). Our configuration is available at [TODO]. We set tokenisers to ```use_fast=False```. We report **0-shot** accuracy.
222
-
223
- | 0-shot | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b** |
224
- |----------|:-----------------:|:---------------------:|:-------------------:|
225
- | Bulgarian | 48.3% | 52.0% | **56.3%** |
226
- | Czech | 49.1% | 51.7% | **56.4%** |
227
- | Danish | 50.2% | 51.1% | **56.6%** |
228
- | German | 51.0% | 51.8% | **56.2%** |
229
- | Greek | 50.7% | 50.6% | **50.9%** |
230
- | Spanish | 53.3% | 53.4% | **56.3%** |
231
- | Estonian | 48.7% | 49.2% | **55.3%** |
232
- | Finnish | 47.4% | 48.9% | **55.4%** |
233
- | French | 53.1% | 53.8% | **56.4%** |
234
- | Hungarian | 49.9% | 44.4% | **55.2%** |
235
- | Italian | 52.3% | 53.7% | **57.2%** |
236
- | Lithuanian | 47.3% | 49.4% | **54.7%** |
237
- | Latvian | 46.9% | 48.0% | **54.0%** |
238
- | Dutch | 50.8% | 53.0% | **56.5%** |
239
- | Polish | 50.6% | 49.6% | **55.6%** |
240
- | Portuguese | 52.4% | 53.7% | **56.4%** |
241
- | Romanian | 51.0% | 52.1% | **56.2%** |
242
- | Slovak | 49.0% | 52.2% | **56.3%** |
243
- | Slovenian | 48.2% | 50.7% | **55.3%** |
244
- | Swedish | 49.6% | 51.2% | **56.1%** |
245
- | **Average** | 50.0% | 51.0% | **55.7%** |
246
-
247
- ### National Exams Results
248
- **What are National Exams?** A curated suite of **multlingual** publicly available past questions from national-level standardized exams across multiple countries (e.g., high-school exit and university-entrance exams), please refer to the [publication](https://aclanthology.org/2020.emnlp-main.438.pdf). The dataset is available on HuggingFace [here](https://huggingface.co/datasets/mhardalov/exams). Items are presented in multiple-choice format.
249
-
250
- **Why does this Matter?** Similarly to MMLU, the model is tested on factual knowledge and reasoning capabilites. However, it should be stressed that for each language the bench is **unique** (the exams are different) and available in the **source language** (i.e. not translated). This places emphasis on the model's regional knowledge and eliminates translation noise that is present in many other multilingual benchmarks. Possible answer choices are once again included during inference, allowing for the model to employ reasoning if factual knowledge is lacking.
251
-
252
- **What did we do?** [TODO]
253
-
254
- | 5-shot | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 1.1 30b** |
255
- |----------|----------|-------------------|-------------------|
256
- | Bulgarian | 62.4% | 66.8% | **67.8%** |
257
- | Croatian | 70.8% | **72.5%** | 71.9% |
258
- | Hungarian | 48.9% | **51.9%** | 48.9% |
259
- | Italian | **65.5%** | 64.6% | 65.0% |
260
- | Macedonian | 74.2% | 72.0% | **80.2%** |
261
- | Polish | 61.2% | 61.4% | **63.5%** |
262
- | Portuguese | **61.4%** | 60.9% | 59.2% |
263
- | Albanian | 55.6% | 55.0% | **75.6%** |
264
- | Serbian | 64.7% | 57.3% | **66.9%** |
265
- | **Average** | 62.7% | 62.5% | **66.6%** |
 
41
  - uonlp/CulturaX
42
  - bigcode/the-stack
43
  - common-pile/arxiv_papers
 
44
  ---
45
  **Developed by:** [Tilde.ai](https://tilde.ai/tildeopen-llm/)
46
  **Funded by:** European Commission via [EuroHPC JU Large AI Grand Challenge](https://www.eurohpc-ju.europa.eu/winners-announced-large-ai-grand-challenge-2024-06-26_en)
 
104
  )
105
  ```
106
  # Evaluation
107
+ ## Per-Character Perplexity
108
+ **What is Perplexity?** Perplexity measures how well a language model predicts text. A model with low perplexity makes accurate predictions consistently, while a high perplexity means the model is frequently "surprised" by unexpected words or patterns. Lower perplexity indicates the model has learned language patterns more effectively. It's less "surprised" by what it encounters because it better understands how the language works.
109
+ Perplexity fairly evaluates how well each model handles:
110
+ - Spelling accuracy across a diverse vocabulary
111
+ - Grammar rules that span multiple words
112
+ - Sentence structure and flow
113
+ - Language-specific patterns (how different languages form plural forms or compound words)
114
+
115
+ **Why Character-Level?** Different language models use different internal vocabularies - some break text into whole words, others into word fragments, and some into individual characters. This makes direct comparison difficult.
116
+ Character-level perplexity creates a standardised comparison by calculating how well each model would theoretically perform if we measured their predictions character-by-character. We're not changing how the models work - instead, we use mathematical conversion to approximate their character-level performance based on their predictions.
117
+
118
+ **Why does this Matter?** Models with lower perplexity generally perform better on real-world tasks like text generation, translation, and understanding context. It's a reliable indicator of overall language competency across different applications.
119
+
120
+ **What data did we use?**
121
+ We use WMT24++ as it is a multilingual, language-parallel evaluation set that none of the models have seen during training. WMT24++ is a composite of texts from news, literature, speech, and social media; thus, it is suitable for foundational model benchmarking.
122
+
123
+ | Language | TildeOpen 30b | Gemma 2 27b | EuroLLM 22B Prev. | ALIA 40B |
124
+ |-----------------|---------|------------|----|------|
125
+ | Bulgarian | **2.0539** | 2.2184 | 2.1985 | 2.1336 |
126
+ | Czech | **2.1579** | 2.3522 | 2.3221 | 2.2719 |
127
+ | Danish | **2.003** | 2.1517 | 2.1353 | 2.0805 |
128
+ | German | **1.8769** | 1.9285 | 1.9452 | 1.904 |
129
+ | English | 2.0378 | **1.9525** | 2.0568 | 2.0261 |
130
+ | Spanish | 1.9503 | 1.9752 | 2.0145 | **1.9369** |
131
+ | Estonian | **2.1711** | 2.5747 | 2.3852 | 2.325 |
132
+ | Finnish | **2.0497** | 2.288 | 2.2388 | 2.1831 |
133
+ | French | **1.8978** | 1.9355 | 1.9282 | 1.9084 |
134
+ | Croatian | **2.1147** | 2.544 | 2.4905 | 2.2433 |
135
+ | Hungarian | **2.0539** | 2.2228 | 2.2256 | 2.1635 |
136
+ | Icelandic | **2.0873** | 3.0329 | 4.7908 | 3.957 |
137
+ | Italian | **1.9565** | 2.0137 | 2.0098 | 1.9887 |
138
+ | Lithuanian | **2.1247** | 2.4175 | 2.3137 | 2.3075 |
139
+ | Latvian | **2.1439** | 2.5355 | 2.3141 | 2.3276 |
140
+ | Dutch | **1.9333** | 2.0312 | 2.0079 | 1.9904 |
141
+ | Norwegian | **2.1284** | 2.2862 | 2.3506 | 2.2253 |
142
+ | Polish | **2.0241** | 2.1294 | 2.0803 | 2.0803 |
143
+ | Portuguese | **1.9899** | 2.0597 | 2.0272 | 2.0187 |
144
+ | Romanian | **2.0196** | 2.1606 | 2.1641 | 2.1114 |
145
+ | Russian | **2.0424** | 2.09 | 2.1095 | 2.0871 |
146
+ | Slovak | **2.1192** | 2.338 | 2.3029 | 2.2609 |
147
+ | Slovenian | **2.1556** | 2.4443 | 2.3398 | 2.2589 |
148
+ | Serbian | **2.2469** | 2.6351 | 4.2471 | 2.3743 |
149
+ | Swedish | **2.041** | 2.1809 | 2.1464 | 2.1211 |
150
+ | Turkish | **2.0997** | 2.247 | 2.2202 | 2.232 |
151
+ | Ukrainian | **2.1376** | 2.2665 | 2.2691 | 2.2086 |