SmolLM3-3B_SFT_subset-fineweb-edu

Model Description

This model has been fine-tuned on streamed data from FineWeb-Edu, a high-quality, English-language educational web corpus derived from CommonCrawl, filtered for educational content using an LLM-based quality classifier. This is a fine-tuned version of the SmolLM3-3B base model, using Supervised Fine-Tuning (SFT) on a curated subset of the FineWeb-Edu dataset. The fine-tuning focuses on improving the model's ability to generate coherent, informative responses on educational topics (e.g., science, history, math explanations). I used LoRA (Low-Rank Adaptation) for efficient parameter-efficient fine-tuning, training only ~30M parameters while keeping the full 3B model intact.

Base Model: HuggingFaceTB/SmolLM3-3B
Fine-Tuning Method: LoRA + Causal Language Modeling
Dataset Subset: ~20,000 examples streamed from FineWeb-Edu train split
Hardware: 3x NVIDIA GPUs A6000
Precision: bfloat16 (bf16)
Total Trainable Parameters: 30,228,480 (~1% of base model)

Training Data

I streamed a subset of FineWeb-Edu (train split) using Hugging Face Datasets with streaming=True to avoid full download (dataset is ~hundreds of GB across 2,410 shards).

Data Preparation

Source: High-quality educational web texts (e.g., articles, tutorials from CommonCrawl).
Filtering:
- Text length > 20 characters (discard very short snippets).
- Post-tokenization: 10 < sequence length < 1024 tokens (discard too short/long for efficient training).
Subset Size: Processed ~26,000 raw examples, kept ~20,000 after filtering (retention rate ~77%).
Tokenization: Using SmolLM3 tokenizer, truncated to max_length=1024, right-padded to fixed 1024 for batching.
No Additional Preprocessing: Raw text as-is for causal LM objective (labels = input_ids, with pad masking). Full dataset stats (for reference): ~10B documents, ~1.3T tokens, English-only.

Training Procedure

LoRA Configuration

Task Type: CAUSAL_LM
Rank (r): 16
Alpha: 32
Dropout: 0.1
Target Modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

Training Hyperparameters

Parameter	Value	Notes
Optimizer	adamw_torch
Learning Rate	5e-5
Weight Decay	0.01
LR Scheduler	cosine
Warmup Steps	100
Max Gradient Norm	1.0	Clipping
Num Epochs	1	Single pass over subset
Per-Device Batch Size	4	Across 3 GPUs (effective batch=12)
Gradient Accumulation	4	Effective batch=48
Total Steps	~1,250	20k examples / 16 effective batch
Logging Steps	50
Save Steps	1,000	Max 2 checkpoints
Mixed Precision	bf16=True	fp16=False
Seed	42	Reproducibility
Dataloader Workers	0
Remove Unused Columns	False

Data Collator: DataCollatorForLanguageModeling (mlm=False, pad_to_multiple_of=8) for causal LM loss (ignores pads via -100 labels).
Loss Function: Cross-entropy on non-pad tokens.
Training Time: ~3 hours on 3x A100/H100 GPUs (estimated; actual depends on hardware).
Output Dir: ./outputs_simple_streaming (checkpoints), final merged model saved to ./final_streaming_model.

Evaluation

No formal evaluation was performed during training (eval_strategy="no"). For quick checks:

Use perplexity on held-out FineWeb-Edu samples.
Human eval: Compare generations on educational prompts vs. base model. Example Perplexity (hypothetical; run your own):
Base: ~10.5 ppl on test set.
Fine-tuned: ~9.8 ppl (slight improvement).

How to Use

Installation

pip install transformers torch accelerate peft

Inference with Transformers

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
model_id = "safouaneelg/SmolLM3-3B_SFT_subset-fineweb-edu"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
# Pipeline for easy generation
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
prompt = "The capital of France is"
outputs = generator(prompt, max_new_tokens=50, do_sample=True, temperature=0.7, top_p=0.9)
print(outputs[0]["generated_text"])

Output: Generated: a city of some 3 million inhabitants, located on the banks of the Seine, 200 kilometres (125 miles) upstream from the sea. Paris has been the capital of France since 508 AD and is one of the most famous cities in the world. It is also one of the most important tourist attractions in the world. The city's main sights include the Eiffel Tower, Notre-Dame Cathedral, the Louvre Museum, the Arc de Triomphe, and Montmartre. Paris is home to many famous museums, including the Louvre, the Orsay, and the Rodin. It is also home to many famous art galleries and restaurants. Paris is also known for its fashion, film, and music industries. The city was founded by the Romans in 51 BC and grew to be one of the most important cities in the Roman Empire. The city was also an important centre of commerce and culture. The city was attacked by the Vikings in 845 AD and was burned down in 1081. It was rebuilt in the 12th century and became the centre of the French royal court. In the 16th century, the city was the site of the French Revolution. It was the centre of the French Empire from 1789 to

Fine-Tuning Script Reproduction

Soon available on my github github.com/safouaneelg

Merging LoRA (if loading adapter separately)

This repo contains the merged full model (non-quantized, bf16). If you have the adapter:

from peft import PeftModel
merged = peft_model.merge_and_unload()

Citation

If using this model, please cite the base:

@misc{smollm3,
  title={SmolLM3: Improved Small Language Models},
  author={Hugging Face Team},
  year={2024},
  url={https://huggingface.co/HuggingFaceTB/SmolLM3-3B}
}

Last Updated: September 26, 2025 Author: Safouane El Ghazouali (based on SmolLM3 + FineWeb-Edu)

Downloads last month: 1

Safetensors

Model size

3B params

Tensor type

BF16

Model tree for safouaneelg/SmolLM3-3B_SFT_20K-subset-fineweb-edu

Base model

HuggingFaceTB/SmolLM3-3B-Base

Finetuned

HuggingFaceTB/SmolLM3-3B

Adapter

(17)

this model

safouaneelg
/

SmolLM3-3B_SFT_20K-subset-fineweb-edu