HuggingFaceTB
/

SmolLM3-3B-Base

@@ -32,16 +32,16 @@ language:
 SmolLM3 is a 3B parameter language model designed to push the boundaries of small models. It supports 6 languages, advanced reasoning and long context. SmolLM3 is a fully open model that offers strong performance at the 3B–4B scale.
-The model is a decoder-only transformer using GQA and NoRope, it was trained on 11.2T tokens with a staged curriculum of web, code, math and reasoning data. The training framework is [nanotron](https://github.com/huggingface/nanotron/).
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/Zcm_016pWeyFr_uIkT7Ki.png)
 ### Key features
-- **Long context:** Trained on 64k context and suppots up to **128k tokens** using YARN extrapolation
-- **Multilingual**: 6 natively supported (English, French, Spanish, German, Italian, and Portuguese)
 - Instruct model optimized for **hybrid reasoning**
 - **Fully open model**: open weights + full training details including public data mixture and training configs
 For more details refer to our blog post: TODO
@@ -184,7 +184,8 @@ SmolLM3 can produce text on a variety of topics, but the generated content may n
 - **Training Framework:** [nanotron](https://github.com/huggingface/nanotron/tree/main)
 - **Data processing framework:** [datatrove](https://github.com/huggingface/datatrove)
 - **Evaluation framework:** [lighteval](https://github.com/huggingface/lighteval)
 ### Open resources
 Here is an infographic with all the training details [TODO].
 - The datasets used for pretraining can be found in this [collection](https://huggingface.co/collections/HuggingFaceTB/smollm3-pretraining-datasets-685a7353fdc01aecde51b1d9) and those used in mid-training and pos-training can be found here [TODO]

 SmolLM3 is a 3B parameter language model designed to push the boundaries of small models. It supports 6 languages, advanced reasoning and long context. SmolLM3 is a fully open model that offers strong performance at the 3B–4B scale.
+The model is a decoder-only transformer using GQA and NoRope, it was trained on 11.2T tokens with a staged curriculum of web, code, math and reasoning data. Post-training included midtraining on 100B reasoning followed by supervised fine-tuning and alignment via Anchored Preference Optimization.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/Zcm_016pWeyFr_uIkT7Ki.png)
 ### Key features
 - Instruct model optimized for **hybrid reasoning**
 - **Fully open model**: open weights + full training details including public data mixture and training configs
+- **Long context:** Trained on 64k context and suppots up to **128k tokens** using YARN extrapolation
+- **Multilingual**: 6 natively supported (English, French, Spanish, German, Italian, and Portuguese)
 For more details refer to our blog post: TODO
 - **Training Framework:** [nanotron](https://github.com/huggingface/nanotron/tree/main)
 - **Data processing framework:** [datatrove](https://github.com/huggingface/datatrove)
 - **Evaluation framework:** [lighteval](https://github.com/huggingface/lighteval)
+- **Postraining Framework:** [TRL](https://github.com/huggingface/trl)
 ### Open resources
 Here is an infographic with all the training details [TODO].
 - The datasets used for pretraining can be found in this [collection](https://huggingface.co/collections/HuggingFaceTB/smollm3-pretraining-datasets-685a7353fdc01aecde51b1d9) and those used in mid-training and pos-training can be found here [TODO]