File size: 4,779 Bytes
f3b6217
 
 
0efbca7
f3b6217
 
c46b740
 
 
e239569
 
 
 
f3b6217
 
 
 
b3b8156
 
 
 
 
 
 
 
 
 
 
f3b6217
 
b3b8156
f3b6217
 
b3b8156
 
 
 
 
 
 
f3b6217
b3b8156
f3b6217
b3b8156
 
 
f3b6217
b3b8156
 
 
 
 
 
 
 
 
 
f3b6217
 
b3b8156
 
 
 
 
 
 
 
 
 
 
d093b44
 
5630386
b3b8156
5630386
 
 
 
 
 
 
 
f3b6217
 
b3b8156
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0549503
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
datasets:
- togethercomputer/RedPajama-Data-V2
- LSX-UniWue/LLaMmlein-Dataset
language:
- de
library_name: transformers
license: other
pipeline_tag: feature-extraction
tags:
- masked-lm
- long-context
- modernbert
---

# ModernGBERT 1B

ModernGBERT 1B is a German ModernBERT language model with 1 billion parameters and a native context length of up to 8,192 tokens. This model follows the same BERT-style architecture and training procedure as the ModernBERT [codebase](https://github.com/AnswerDotAI/ModernBERT).
ModernGBERT 1B has been pre-trained on the same 1.27 trillion tokens from the German portion of [RedPajama V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2) as our [LLäMmlein](https://huggingface.co/collections/LSX-UniWue/llammlein-6732ff41f3705c686e605762) decoder family.

We provide two model sizes:

* [ModernGBERT 1B](https://huggingface.co/LSX-UniWue/ModernGBERT_1B) ← You are here  
  28 layers, hidden size 2,048, 1 billion parameters

* [ModernGBERT 134M](https://huggingface.co/LSX-UniWue/ModernGBERT_134M)   
  22 layers, hidden size 768, 134 million parameters

Find more details in our [preprint](https://arxiv.org/abs/2505.13136)!


### Usage

You can use ModernGBERT with the `transformers` library from version v4.48.0 onwards.
(Optional: install `flash-attn` to achieve highest efficiency.)

Since ModernGBERT 1B is a Masked Language Model (MLM), you can load it via `AutoModelForMaskedLM`. For downstream tasks such as classification, retrieval, or QA, fine-tune the model by following standard BERT fine-tuning recipes.

Example using `AutoModelForMaskedLM`: 

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "LSX-UniWue/ModernGBERT_1B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

text = "Die Hauptstadt von Frankreich ist [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token:  Paris
```

**NOTE:** If you want to use HuggingFace's PEFT library for LoRA training, you need to specify the target modules, e.g.:

```python
from peft import LoraConfig, get_peft_model
peft_config = LoraConfig(
    task_type="TOKEN_CLS", r=8, lora_alpha=32,
    target_modules=["Wqkv", "Wi", "Wo"],
)
model = get_peft_model(model, peft_config)
```

### Intermediate Checkpoints
In addition to the final model checkpoint, we publish intermediate checkpoints throughout the full training process as unique branches in this repository. 
A specific checkpoint can be loaded like this: 

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "LSX-UniWue/ModernGBERT_1B"
revision = "base-head-12000-ckpt"
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
model = AutoModelForMaskedLM.from_pretrained(model_id, revision=revision)
```

### Performance 
We evaluate our models across a broad range of tasks. For natural language understanding, we use the [SuperGLEBer](https://lsx-uniwue.github.io/SuperGLEBer-site/) benchmark, and for embedding capabilities, we use the [German MTEB](http://mteb-leaderboard.hf.space/?benchmark_name=MTEB%28deu%2C+v1%29) benchmark (after unsupervised fine-tuning of every model on the German mMARCO portion). The following table provides a comparison of this encoder with other German and multilingual encoders. See our [preprint](https://arxiv.org/abs/2505.13136) for more details about the evaluation.

| Model                            | SuperGLEBer Avg | MTEB Avg  |
|----------------------------------|-----------------|-----------|
| ModernGBERT 1B<br>(you are here) | **0.808**       | **0.551** |
| ModernGBERT 134M                 | 0.749           | 0.501     |
| GBERT-base                       | 0.718           | 0.500     |
| GBERT-large                      | 0.768           | 0.521     |
| GeBERTa-base                     | 0.716           | 0.493     |
| GeBERTa-large                    | 0.749           | 0.494     |
| GeBERTa-xlarge                   | 0.767           | 0.521     |
| Gerturax-3                       | 0.740           | 0.472     |
| XLM-RoBERTa-large                | 0.730           | 0.460     |
| XLM-RoBERTa-xlarge               | 0.758           | 0.479     |



### License

We release the ModernGBERT models under a research-only RAIL-M license. See [license.md](./license.md) for details.
[Data Take Down](https://www.informatik.uni-wuerzburg.de/datascience/projects/nlp/llammlein/)