KORMo pretraining datasets

KORMo-Team 's Collections

KORMo-10B

KORMo midtraining datasets

KORMo SFT datasets

updated Oct 13, 2025

The pretraining datasets for KORMo-10B were collected from diverse, publicly available source.

Upvote

KORMo-Team/dclm-baseline-filtered

Preview • Updated Sep 14, 2025 • 4.38k • 1
KORMo-Team/korean-web-collection

Preview • Updated Sep 14, 2025 • 1.12k • 1

Note Stage 1 Pretraining Datasets English - KORMo-Team/dclm-baseline-filtered (~1000B Tokens) Korean - KORMo-Team/korean-web-collection (~42.5B Tokens)
KORMo-Team/UltraFineWeb-filtered

Preview • Updated Sep 28, 2025 • 6.61k • 2
HuggingFaceTB/stack-edu

Viewer • Updated Mar 20, 2025 • 167M • 2.43k • 60
HuggingFaceTB/smollm-corpus

Viewer • Updated Sep 6, 2024 • 237M • 16.6k • 408
HuggingFaceTB/finemath

Viewer • Updated Feb 6, 2025 • 48.3M • 11.3k • 345
HuggingFaceTB/cosmopedia

Viewer • Updated Aug 12, 2024 • 31.1M • 44.8k • 650
nvidia/OpenCodeReasoning

Viewer • Updated May 4, 2025 • 753k • 3.75k • 520
nvidia/OpenMathReasoning

Viewer • Updated May 27, 2025 • 5.68M • 14.1k • 390

Note Stage 2 Pretraining Datasets English - KORMo-Team/UltraFineWeb-english-filtered (~793B Tokens) - HuggingFaceTB/finemath (~37.3B Tokens, sampling) - HuggingFaceTB/stack-edu & HuggingFaceTB/smollm-corpus(python-edu) (~152B Tokens, sampling) - HuggingFaceTB/cosmopedia (~25B Tokens) - nvidia/OpenCodeReasoning (~0.65B Tokens, sampling) - nvidia/OpenMathReasoning (~3.19B Tokens, sampling)
KORMo-Team/korean-public-corpus

Preview • Updated Sep 27, 2025 • 608
KORMo-Team/FineWeb2-ko-synth

Preview • Updated Oct 13, 2025 • 617
KORMo-Team/Cosmopedia-ko-synth

Preview • Updated Oct 13, 2025 • 902
KORMo-Team/UltraFineWeb-ko-synth

Preview • Updated Oct 13, 2025 • 1.08k
KORMo-Team/NemoPost-ko-synth

Preview • Updated Oct 13, 2025 • 408 • 1

Note Stage 2 Pretraining Datasets Korean - KORMo-Team/korean-public-corpus (~5.57B Tokens) - KORMo-Team/FineWeb2-ko-synth (~10.97B Tokens) - KORMo-Team/Cosmopedia-ko-synth (~4.07B Tokens) - KORMo-Team/UltraFineWeb-ko-synth (~41.69B Tokens) - KORMo-Team/NemoPost-ko-synth (~7.05B Tokens)

Upvote