--- license: apache-2.0 datasets: - oscar-corpus/OSCAR-2301 language: - hu base_model: - answerdotai/ModernBERT-base pipeline_tag: fill-mask --- # Hungarian ModernBERT-base Model & tokenizer pretrained from scratch (MLM task) only on cleaned(removal of toxic and sexual content, automatic language correctness filtering) and semantically segmented Hungarian text. Vocab: 52k Context length of up to 8,192 tokens Train dataset: ~1billion tokens