Model Description
- Language: Norwegian Bokmål and Nynorsk
- Developed by: HPLT
- Paper: arxiv.org/abs/2511.01066
- Evaluation results: hf.co/datasets/HPLT/2508-datasets-evals using HPLT-E
- License: Apache 2.0
The HPLT's Llama-2b collection comprises monolingual decoder-only language models pretrained by the HPLT team as part of the third release.
The models are released as artifacts of our ablation studies on evaluating different corpora and sampling strategies across multiple languages:
- ⚖️ HPLT Pre-3.0 Comparison: Comparison of data deduplication strategies on a pre-release version of HPLT 3.0 across nine selected languages (HPLT 3.0 pre-release).
- 📚 Corpora Comparison: Evaluation of HPLT 2.0, HPLT 3.0, FineWeb 2.1.0, and MADLAD-400 1.0 on nine selected languages (HPLT 3.0 release).
- 🧰 Web Document Scorer (WDS) Comparison: Analysis of HPLT 3.0 corpora sampled using different WDS thresholds, focusing on Spanish and French (HPLT 3.0 release).
Please find more details in our GitHub repository and pre-print.
Model Architecture
All models follow the Llama architecture with 24 layers, 32 attention heads, and a sequence length of 2048. The tokenizer is Gemma-3 with the vocabulary size of 262K tokens.
Pretraining Corpus
This model is pretrained on 100B tokens from FineWeb2.1.0 from scratch. For lower-resource languages with less than 100B tokens of available data, datasets are uniformly upsampled (repeated) following Muennighoff et al. (2023). Pretraining is run using the Megatron-LM framework on the LUMI supercomputer, employing 16 AMD MI250x nodes.
Intended Use
Intended Use Cases: The model is intended for research use in Norwegian Bokmål and Nynorsk and reproducibility purposes. Since this model is only pretrained, its performance can be potentially improved in a variety of natural language understanding and generation tasks using post-training data.
Out of Scope: Model usage in languages beyond the explicitly referenced as supported in this model card.
How to use
This repository contains the following intermediate checkpoints due to limited quota resources:
2B10B21B31B40B50B61B71B80B90Bmain
The other checkpoints can be provided upon request.
Use with Transformers
You can run the inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.
import torch
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="HPLT/fineweb-2.1.0-nor_Latn-llama-2b-100bt",
torch_dtype=torch.bfloat16,
device_map="auto"
)
Specific intermediate checkpoint can be accessed using the revision argument when loading the model.
from transformers import AutoModelForCausalLM
import torch
revision = "10B"
model = AutoModelForCausalLM.from_pretrained(
"HPLT/fineweb-2.1.0-nor_Latn-llama-2b-100bt",
torch_dtype=torch.bfloat16,
revision=revision,
device_map="auto"
)
Cite us
@article{oepen2025hplt,
title={HPLT\~{} 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono-and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models},
author={Oepen, Stephan and Arefev, Nikolay and Aulamo, Mikko and Ba{\~n}{\'o}n, Marta and Buljan, Maja and Burchell, Laurie and Charpentier, Lucas and Chen, Pinzhen and Fedorova, Mariya and de Gibert, Ona and others},
journal={arXiv preprint arXiv:2511.01066},
year={2025}
}
- Downloads last month
- 120