Model Description

Language: Norwegian Bokmål and Nynorsk
Developed by: HPLT
Paper: arxiv.org/abs/2511.01066
Evaluation results: hf.co/datasets/HPLT/2508-datasets-evals using HPLT-E
License: Apache 2.0

The HPLT's Llama-2b collection comprises monolingual decoder-only language models pretrained by the HPLT team as part of the third release.

The models are released as artifacts of our ablation studies on evaluating different corpora and sampling strategies across multiple languages:

⚖️ HPLT Pre-3.0 Comparison: Comparison of data deduplication strategies on a pre-release version of HPLT 3.0 across nine selected languages (HPLT 3.0 pre-release).
📚 Corpora Comparison: Evaluation of HPLT 2.0, HPLT 3.0, FineWeb 2.1.0, and MADLAD-400 1.0 on nine selected languages (HPLT 3.0 release).
🧰 Web Document Scorer (WDS) Comparison: Analysis of HPLT 3.0 corpora sampled using different WDS thresholds, focusing on Spanish and French (HPLT 3.0 release).

Please find more details in our GitHub repository and pre-print.

Model Architecture

All models follow the Llama architecture with 24 layers, 32 attention heads, and a sequence length of 2048. The tokenizer is Gemma-3 with the vocabulary size of 262K tokens.

Pretraining Corpus

This model is pretrained on 100B tokens from FineWeb2.1.0 from scratch. For lower-resource languages with less than 100B tokens of available data, datasets are uniformly upsampled (repeated) following Muennighoff et al. (2023). Pretraining is run using the Megatron-LM framework on the LUMI supercomputer, employing 16 AMD MI250x nodes.

Intended Use

Intended Use Cases: The model is intended for research use in Norwegian Bokmål and Nynorsk and reproducibility purposes. Since this model is only pretrained, its performance can be potentially improved in a variety of natural language understanding and generation tasks using post-training data.

Out of Scope: Model usage in languages beyond the explicitly referenced as supported in this model card.

How to use

This repository contains the following intermediate checkpoints due to limited quota resources:

2B
10B
21B
31B
40B
50B
61B
71B
80B
90B
main

The other checkpoints can be provided upon request.

Use with Transformers

You can run the inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.

import torch
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="HPLT/fineweb-2.1.0-nor_Latn-llama-2b-100bt", 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)

Specific intermediate checkpoint can be accessed using the revision argument when loading the model.

from transformers import AutoModelForCausalLM
import torch

revision = "10B"

model = AutoModelForCausalLM.from_pretrained(
    "HPLT/fineweb-2.1.0-nor_Latn-llama-2b-100bt",
    torch_dtype=torch.bfloat16,
    revision=revision,
    device_map="auto"
)

Cite us

@article{oepen2025hplt,
  title={HPLT\~{} 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono-and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models},
  author={Oepen, Stephan and Arefev, Nikolay and Aulamo, Mikko and Ba{\~n}{\'o}n, Marta and Buljan, Maja and Burchell, Laurie and Charpentier, Lucas and Chen, Pinzhen and Fedorova, Mariya and de Gibert, Ona and others},
  journal={arXiv preprint arXiv:2511.01066},
  year={2025}
}

Downloads last month: 120

Safetensors

Model size

2B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including HPLT/fineweb-2.1.0-nor_Latn-llama-2b-100bt

2508-datasets

Collection

Llama-2b ablation models released as part of HPLT 3.0 • 36 items • Updated 1 day ago