KORMo pretraining datasets
The pretraining datasets for KORMo-10B were collected from diverse, publicly available source.
-
Preview • Updated • 3.05k • 1
KORMo-Team/korean-web-collection
Preview • Updated • 1.74k • 1Note Stage 1 Pretraining Datasets English - KORMo-Team/dclm-baseline-filtered (~1000B Tokens) Korean - KORMo-Team/korean-web-collection (~42.5B Tokens)
-
KORMo-Team/UltraFineWeb-filtered
Preview • Updated • 1.8k • 1 -
HuggingFaceTB/stack-edu
Viewer • Updated • 167M • 3.39k • 52 -
HuggingFaceTB/smollm-corpus
Viewer • Updated • 237M • 16.9k • 389 -
HuggingFaceTB/finemath
Viewer • Updated • 48.3M • 5.4k • 335 -
HuggingFaceTB/cosmopedia
Viewer • Updated • 31.1M • 38.8k • 643 -
nvidia/OpenCodeReasoning
Viewer • Updated • 753k • 3.15k • 502
nvidia/OpenMathReasoning
Viewer • Updated • 5.68M • 5.91k • 346Note Stage 2 Pretraining Datasets English - KORMo-Team/UltraFineWeb-english-filtered (~793B Tokens) - HuggingFaceTB/finemath (~37.3B Tokens, sampling) - HuggingFaceTB/stack-edu & HuggingFaceTB/smollm-corpus(python-edu) (~152B Tokens, sampling) - HuggingFaceTB/cosmopedia (~25B Tokens) - nvidia/OpenCodeReasoning (~0.65B Tokens, sampling) - nvidia/OpenMathReasoning (~3.19B Tokens, sampling)
-
KORMo-Team/korean-public-corpus
Preview • Updated • 351 -
KORMo-Team/FineWeb2-ko-synth
Preview • Updated • 958 -
KORMo-Team/Cosmopedia-ko-synth
Preview • Updated • 673 -
KORMo-Team/UltraFineWeb-ko-synth
Preview • Updated • 1.83k
KORMo-Team/NemoPost-ko-synth
Preview • Updated • 172 • 1Note Stage 2 Pretraining Datasets Korean - KORMo-Team/korean-public-corpus (~5.57B Tokens) - KORMo-Team/FineWeb2-ko-synth (~10.97B Tokens) - KORMo-Team/Cosmopedia-ko-synth (~4.07B Tokens) - KORMo-Team/UltraFineWeb-ko-synth (~41.69B Tokens) - KORMo-Team/NemoPost-ko-synth (~7.05B Tokens)