SmolLM3-3B_SFT_subset-fineweb-edu
Model Description
This model has been fine-tuned on streamed data from FineWeb-Edu, a high-quality, English-language educational web corpus derived from CommonCrawl, filtered for educational content using an LLM-based quality classifier. This is a fine-tuned version of the SmolLM3-3B base model, using Supervised Fine-Tuning (SFT) on a curated subset of the FineWeb-Edu dataset. The fine-tuning focuses on improving the model's ability to generate coherent, informative responses on educational topics (e.g., science, history, math explanations). I used LoRA (Low-Rank Adaptation) for efficient parameter-efficient fine-tuning, training only ~30M parameters while keeping the full 3B model intact.
- Base Model: HuggingFaceTB/SmolLM3-3B
- Fine-Tuning Method: LoRA + Causal Language Modeling
- Dataset Subset: ~20,000 examples streamed from FineWeb-Edu
trainsplit - Hardware: 3x NVIDIA GPUs A6000
- Precision: bfloat16 (bf16)
- Total Trainable Parameters: 30,228,480 (~1% of base model)
Training Data
I streamed a subset of FineWeb-Edu (train split) using Hugging Face Datasets with streaming=True to avoid full download (dataset is ~hundreds of GB across 2,410 shards).
Data Preparation
- Source: High-quality educational web texts (e.g., articles, tutorials from CommonCrawl).
- Filtering:
- Text length > 20 characters (discard very short snippets).
- Post-tokenization: 10 < sequence length < 1024 tokens (discard too short/long for efficient training).
- Subset Size: Processed ~26,000 raw examples, kept ~20,000 after filtering (retention rate ~77%).
- Tokenization: Using SmolLM3 tokenizer, truncated to max_length=1024, right-padded to fixed 1024 for batching.
- No Additional Preprocessing: Raw text as-is for causal LM objective (labels = input_ids, with pad masking). Full dataset stats (for reference): ~10B documents, ~1.3T tokens, English-only.
Training Procedure
LoRA Configuration
- Task Type: CAUSAL_LM
- Rank (r): 16
- Alpha: 32
- Dropout: 0.1
- Target Modules:
["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
Training Hyperparameters
| Parameter | Value | Notes |
|---|---|---|
| Optimizer | adamw_torch | |
| Learning Rate | 5e-5 | |
| Weight Decay | 0.01 | |
| LR Scheduler | cosine | |
| Warmup Steps | 100 | |
| Max Gradient Norm | 1.0 | Clipping |
| Num Epochs | 1 | Single pass over subset |
| Per-Device Batch Size | 4 | Across 3 GPUs (effective batch=12) |
| Gradient Accumulation | 4 | Effective batch=48 |
| Total Steps | ~1,250 | 20k examples / 16 effective batch |
| Logging Steps | 50 | |
| Save Steps | 1,000 | Max 2 checkpoints |
| Mixed Precision | bf16=True | fp16=False |
| Seed | 42 | Reproducibility |
| Dataloader Workers | 0 | |
| Remove Unused Columns | False |
- Data Collator:
DataCollatorForLanguageModeling(mlm=False, pad_to_multiple_of=8) for causal LM loss (ignores pads via -100 labels). - Loss Function: Cross-entropy on non-pad tokens.
- Training Time: ~3 hours on 3x A100/H100 GPUs (estimated; actual depends on hardware).
- Output Dir:
./outputs_simple_streaming(checkpoints), final merged model saved to./final_streaming_model.
Evaluation
No formal evaluation was performed during training (eval_strategy="no"). For quick checks:
- Use perplexity on held-out FineWeb-Edu samples.
- Human eval: Compare generations on educational prompts vs. base model. Example Perplexity (hypothetical; run your own):
- Base: ~10.5 ppl on test set.
- Fine-tuned: ~9.8 ppl (slight improvement).
How to Use
Installation
pip install transformers torch accelerate peft
Inference with Transformers
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
model_id = "safouaneelg/SmolLM3-3B_SFT_subset-fineweb-edu"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
# Pipeline for easy generation
generator = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
device_map="auto"
)
prompt = "The capital of France is"
outputs = generator(prompt, max_new_tokens=50, do_sample=True, temperature=0.7, top_p=0.9)
print(outputs[0]["generated_text"])
Output: Generated: a city of some 3 million inhabitants, located on the banks of the Seine, 200 kilometres (125 miles) upstream from the sea. Paris has been the capital of France since 508 AD and is one of the most famous cities in the world. It is also one of the most important tourist attractions in the world. The city's main sights include the Eiffel Tower, Notre-Dame Cathedral, the Louvre Museum, the Arc de Triomphe, and Montmartre. Paris is home to many famous museums, including the Louvre, the Orsay, and the Rodin. It is also home to many famous art galleries and restaurants. Paris is also known for its fashion, film, and music industries. The city was founded by the Romans in 51 BC and grew to be one of the most important cities in the Roman Empire. The city was also an important centre of commerce and culture. The city was attacked by the Vikings in 845 AD and was burned down in 1081. It was rebuilt in the 12th century and became the centre of the French royal court. In the 16th century, the city was the site of the French Revolution. It was the centre of the French Empire from 1789 to
Fine-Tuning Script Reproduction
Soon available on my github github.com/safouaneelg
Merging LoRA (if loading adapter separately)
This repo contains the merged full model (non-quantized, bf16). If you have the adapter:
from peft import PeftModel
merged = peft_model.merge_and_unload()
Citation
If using this model, please cite the base:
@misc{smollm3,
title={SmolLM3: Improved Small Language Models},
author={Hugging Face Team},
year={2024},
url={https://huggingface.co/HuggingFaceTB/SmolLM3-3B}
}
Last Updated: September 26, 2025 Author: Safouane El Ghazouali (based on SmolLM3 + FineWeb-Edu)
- Downloads last month
- 1