Model Card for Model ID

ModernBERT model trained on Bulgarian literature, Web, and other datasets - uncased.

Model Details

395M parameter ModernBERT model trained on 29B (35B depending on tokenization) tokens for 3 epochs with Masked Language Modelling objective.

Uses

The model is intended to be used as a base model for fine-tuning tasks in NLP.

Direct Use

>>> from transformers import (
>>>     PreTrainedTokenizerFast,
>>>     ModernBertForMaskedLM,
>>>     pipeline 
>>> )

>>> model = ModernBertForMaskedLM.from_pretrained('AIaLT-IICT/modern_bert_bg_base_uncased')
>>> tokenizer = PreTrainedTokenizerFast.from_pretrained('AIaLT-IICT/modern_bert_bg_base_uncased')

>>> fill_mask = pipeline(
>>>     "fill-mask",
>>>     model=model,
>>>     tokenizer=tokenizer
>>> )


>>> fill_mask("Заради 3 завода няма да [MASK] нито есенниците неподхранени, нито зърното да поскъпне заради тях.")

[{'score': 0.24791079759597778,
  'token': 26913,
  'token_str': 'оставим',
  'sequence': 'заради 3 завода няма да оставим нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
 {'score': 0.1209656149148941,
  'token': 35612,
  'token_str': 'допуснем',
  'sequence': 'заради 3 завода няма да допуснем нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
 {'score': 0.10752104222774506,
  'token': 17875,
  'token_str': 'останат',
  'sequence': 'заради 3 завода няма да останат нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
 {'score': 0.09038839489221573,
  'token': 12941,
  'token_str': 'има',
  'sequence': 'заради 3 завода няма да има нито есенниците неподхранени, нито зърното да поскъпне заради тях.'},
 {'score': 0.0655432641506195,
  'token': 15017,
  'token_str': 'остави',
  'sequence': 'заради 3 завода няма да остави нито есенниците неподхранени, нито зърното да поскъпне заради тях.'}]

Out-of-Scope Use

The model is not trained on Next Sentence prediction so the [CLS] token embedding will not be useful out of the box. If you want to use the model for Sequence classification it is recommended to fine-tune it.

Recommendations

It is recommended to use the model for Token Classification and Sequence classification fine-tuning tasks. The model can be used within SentenceTransformers framework for producing embeddings.

Training Details

Training Data

Trained on 29B tokens consisting of deduplicated union of:

Training Procedure

Trained with Masked Language Modelling with 20% masks for 3 epochs with bf16 mixed precision, 512 tokens context and batch size of 256*512 tokens. After this training session, the context length was extended to 8,192 tokens through additional training with a batch size of 128 tokens and approximately 900K documents with longer contexts, filtered from the training data, totaling about 7B words.

Evaluation

The model is evaluated on the Masked Language Modelling objective on test split with 20% random masked tokens. It achieves test loss of 0.85 and test accuracy of 80.16%

Model Card Authors

Nikolay Paev, Kiril Simov

Model Card Contact

[email protected]

Downloads last month
38
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AIaLT-IICT/modern_bert_bg_large_uncased

Finetunes
1 model

Dataset used to train AIaLT-IICT/modern_bert_bg_large_uncased