Model Card for Model ID
ModernBERT model trained on Bulgarian literature, Web, and other datasets - uncased.
Model Details
395M parameter ModernBERT model trained on 29B (35B depending on tokenization) tokens for 3 epochs with Masked Language Modelling objective.
Tokenizer vocabulary size is 50176.
Model hidden dimension is 1024.
Feed-Forward dimension is 2624.
Hidden layer count is 28.
Developed by: Artificial Inteligence and Language Technologies Department at Institute of Information and Communication Technologies - Bulgarian Academy of Sciences.
Funded by: The model is pretrained within the CLaDA-BG: National Interdisciplinary Research E-Infrastructure for Bulgarian Language and Cultural heritage - member of the pan-European research consortia CLARIN-ERIC & DARIAH-ERIC, funded by the Ministry of Education and Science of Bulgaria (support for the Bulgarian National Roadmap for Research Infrastructure). The training was performed at the supercomputer HEMUS at IICT-BAS, part of the RIs of the CoE on Informatics and ICT, financed by the OP SESG (2014–2020), and co-financed by the European Union through the ESIF.
Model type: ModernBERT
Language(s) (NLP): Bulgarian.
License: MIT
Uses
The model is intended to be used as a base model for fine-tuning tasks in NLP.
Direct Use
>>> from transformers import (
>>> PreTrainedTokenizerFast,
>>> ModernBertForMaskedLM,
>>> pipeline
>>> )
>>> model = ModernBertForMaskedLM.from_pretrained('AIaLT-IICT/modern_bert_bg_base_uncased')
>>> tokenizer = PreTrainedTokenizerFast.from_pretrained('AIaLT-IICT/modern_bert_bg_base_uncased')
>>> fill_mask = pipeline(
>>> "fill-mask",
>>> model=model,
>>> tokenizer=tokenizer
>>> )
>>> fill_mask("Заради 3 завода няма да [MASK] нито есенниците неподхранени, нито зърното да поскъпне заради тях.")
[{'score': 0.24791079759597778,
'token': 26913,
'token_str': 'оставим',
'sequence': 'заради 3 завода няма да оставим нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
{'score': 0.1209656149148941,
'token': 35612,
'token_str': 'допуснем',
'sequence': 'заради 3 завода няма да допуснем нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
{'score': 0.10752104222774506,
'token': 17875,
'token_str': 'останат',
'sequence': 'заради 3 завода няма да останат нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
{'score': 0.09038839489221573,
'token': 12941,
'token_str': 'има',
'sequence': 'заради 3 завода няма да има нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
{'score': 0.0655432641506195,
'token': 15017,
'token_str': 'остави',
'sequence': 'заради 3 завода няма да остави нито есенниците неподхранени, нито зърното да поскъпне заради тях.'}]
Out-of-Scope Use
The model is not trained on Next Sentence prediction so the [CLS] token embedding will not be useful out of the box. If you want to use the model for Sequence classification it is recommended to fine-tune it.
Recommendations
It is recommended to use the model for Token Classification and Sequence classification fine-tuning tasks. The model can be used within SentenceTransformers framework for producing embeddings.
Training Details
Training Data
Trained on 29B tokens consisting of deduplicated union of:
- uonlp/CulturaX
- MaCoCu-bg 2.0
- HPLT 2.0 Bulgarian (Cyrillic) cleaned
- Literature
- Wikipedia
- others
Training Procedure
Trained with Masked Language Modelling with 20% masks for 3 epochs with bf16 mixed precision, 512 tokens context and batch size of 256*512 tokens. After this training session, the context length was extended to 8,192 tokens through additional training with a batch size of 128 tokens and approximately 900K documents with longer contexts, filtered from the training data, totaling about 7B words.
Evaluation
The model is evaluated on the Masked Language Modelling objective on test split with 20% random masked tokens. It achieves test loss of 0.85 and test accuracy of 80.16%
Model Card Authors
Nikolay Paev, Kiril Simov
Model Card Contact
- Downloads last month
- 38