MariaFjodorowa commited on
Commit
4b0c596
·
verified ·
1 Parent(s): 7897d0d

Updating README

Browse files
Files changed (1) hide show
  1. README.md +162 -0
README.md ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - eng
5
+ inference: false
6
+ tags:
7
+ - T5
8
+ - t5
9
+ - HPLT
10
+ - encoder-decoder
11
+ - text2text-generation
12
+ license: apache-2.0
13
+ datasets:
14
+ - HPLT/HPLT3.0
15
+ ---
16
+
17
+ # HPLT v3.0 T5 for English
18
+
19
+ <img src="https://hplt-project.org/_next/static/media/logo-hplt.d5e16ca5.svg" width=12.5%>
20
+
21
+ This is one of the encoder-decoder monolingual language models trained as a third release by the [HPLT project](https://hplt-project.org/).
22
+ It is a text-to-text transformer trained with a denoising objective. Our
23
+ models follow the setup of [NorT5](https://aclanthology.org/2023.nodalida-1.61/).
24
+
25
+ We present monolingual NorT5 models for 57 languages out of 198 total in the [HPLT v3.0 dataset](https://hplt-project.org/datasets/v3.0).
26
+
27
+ All the HPLT encoder-decoder models use the same hyper-parameters, roughly following the T5-base setup:
28
+ - hidden size: 768
29
+ - attention heads: 12
30
+ - layers: 12
31
+ - vocabulary size: 32768
32
+
33
+ Every model uses its own tokenizer trained on language-specific HPLT data.
34
+
35
+ [The training code](https://github.com/hplt-project/HPLT-WP4).
36
+
37
+ ## Example usage
38
+
39
+ This model currently needs a custom wrapper from `modeling_nort5.py`, you should therefore load the model with `trust_remote_code=True`.
40
+
41
+ ```
42
+ pip install transformers==4.46.1
43
+ ```
44
+
45
+ ```python
46
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
47
+
48
+ model_path = 'HPLT/hplt_t5_base_3_0_nob_Latn'
49
+ model = AutoModelForSeq2SeqLM.from_pretrained(
50
+ model_path, trust_remote_code=True, use_safetensors=False,
51
+ )
52
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
53
+ # MASKED LANGUAGE MODELING
54
+ sentence = "Ansiktsuttrykket [MASK_1] har utviklet seg til et utbredt kulturelt fenomen."
55
+ encoding = tokenizer(sentence, return_tensors="pt")
56
+ mask_1 = tokenizer.convert_tokens_to_ids("[MASK_1]")
57
+ mask_2 = tokenizer.convert_tokens_to_ids("[MASK_2]")
58
+ output_tensor = model.generate(
59
+ encoding.input_ids,
60
+ decoder_start_token_id=mask_1,
61
+ eos_token_id=mask_2,
62
+ )
63
+ print(tokenizer.decode(output_tensor.squeeze(), skip_special_tokens=False))
64
+ # should output: '[MASK_1]«The Great Gatsby»[MASK_2]'
65
+ ```
66
+
67
+ ## Intermediate checkpoints
68
+
69
+ We are releasing 10 intermediate checkpoints for each model at intervals of every 3125 training steps in separate branches. The naming convention is `stepXXX`: for example, `step18750`.
70
+
71
+ You can load a specific model revision with `transformers` using the argument `revision`:
72
+ ```python
73
+ model = AutoModelForSeq2SeqLM.from_pretrained("HPLT/hplt_t5_base_3_0_eng_Latn", revision="step21875", trust_remote_code=True)
74
+ ```
75
+
76
+ You can access all the revisions for the models with the following code:
77
+ ```python
78
+ from huggingface_hub import list_repo_refs
79
+ out = list_repo_refs("HPLT/hplt_t5_base_3_0_eng_Latn")
80
+ print([b.name for b in out.branches])
81
+ ```
82
+
83
+ ## Cite us
84
+
85
+ ```bibtex
86
+ @inproceedings{samuel-etal-2023-norbench,
87
+ title = "{N}or{B}ench {--} A Benchmark for {N}orwegian Language Models",
88
+ author = "Samuel, David and
89
+ Kutuzov, Andrey and
90
+ Touileb, Samia and
91
+ Velldal, Erik and
92
+ {\O}vrelid, Lilja and
93
+ R{\o}nningstad, Egil and
94
+ Sigdel, Elina and
95
+ Palatkina, Anna",
96
+ editor = {Alum{\"a}e, Tanel and
97
+ Fishel, Mark},
98
+ booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
99
+ month = may,
100
+ year = "2023",
101
+ address = "T{\'o}rshavn, Faroe Islands",
102
+ publisher = "University of Tartu Library",
103
+ url = "https://aclanthology.org/2023.nodalida-1.61/",
104
+ pages = "618--633",
105
+ abstract = "We present NorBench: a streamlined suite of NLP tasks and probes for evaluating Norwegian language models (LMs) on standardized data splits and evaluation metrics. We also introduce a range of new Norwegian language models (both encoder and encoder-decoder based). Finally, we compare and analyze their performance, along with other existing LMs, across the different benchmark tests of NorBench."
106
+ }
107
+ ```
108
+
109
+ ```bibtex
110
+ @inproceedings{burchell-etal-2025-expanded,
111
+ title = "An Expanded Massive Multilingual Dataset for High-Performance Language Technologies ({HPLT})",
112
+ author = {Burchell, Laurie and
113
+ de Gibert, Ona and
114
+ Arefyev, Nikolay and
115
+ Aulamo, Mikko and
116
+ Ba{\~n}{\'o}n, Marta and
117
+ Chen, Pinzhen and
118
+ Fedorova, Mariia and
119
+ Guillou, Liane and
120
+ Haddow, Barry and
121
+ Haji{\v{c}}, Jan and
122
+ Helcl, Jind{\v{r}}ich and
123
+ Henriksson, Erik and
124
+ Klimaszewski, Mateusz and
125
+ Komulainen, Ville and
126
+ Kutuzov, Andrey and
127
+ Kyt{\"o}niemi, Joona and
128
+ Laippala, Veronika and
129
+ M{\ae}hlum, Petter and
130
+ Malik, Bhavitvya and
131
+ Mehryary, Farrokh and
132
+ Mikhailov, Vladislav and
133
+ Moghe, Nikita and
134
+ Myntti, Amanda and
135
+ O{'}Brien, Dayy{\'a}n and
136
+ Oepen, Stephan and
137
+ Pal, Proyag and
138
+ Piha, Jousia and
139
+ Pyysalo, Sampo and
140
+ Ram{\'i}rez-S{\'a}nchez, Gema and
141
+ Samuel, David and
142
+ Stepachev, Pavel and
143
+ Tiedemann, J{\"o}rg and
144
+ Vari{\v{s}}, Du{\v{s}}an and
145
+ Vojt{\v{e}}chov{\'a}, Tereza and
146
+ Zaragoza-Bernabeu, Jaume},
147
+ editor = "Che, Wanxiang and
148
+ Nabende, Joyce and
149
+ Shutova, Ekaterina and
150
+ Pilehvar, Mohammad Taher",
151
+ booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
152
+ month = jul,
153
+ year = "2025",
154
+ address = "Vienna, Austria",
155
+ publisher = "Association for Computational Linguistics",
156
+ url = "https://aclanthology.org/2025.acl-long.854/",
157
+ doi = "10.18653/v1/2025.acl-long.854",
158
+ pages = "17452--17485",
159
+ ISBN = "979-8-89176-251-0",
160
+ abstract = "Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value."
161
+ }
162
+ ```