nanochat-d10-filtered-500m

d10 model trained on 500M tokens of quality-filtered Common Crawl (score ≥ 0.7)

Model Description

This is a d10 model (~100M parameters) trained as part of a research project investigating the impact of training data quality on LLM performance.

Architecture: 10-layer transformer with 640 hidden dimensions Training framework: nanochat Base tokenizer: BPE with 65K vocab

Training Details

Dataset

Source: Common Crawl (quality-filtered)
Size: 500M tokens
Quality filtering: Quality score ≥ 0.7 (nanochat-d32 auditor)

Hyperparameters

Iterations: 10,000
Batch size: 32
Sequence length: 512
Learning rate: 6e-4
Training time: 2.7 hours
Hardware: 2× NVIDIA RTX 6000 Ada

Training Results

Final training loss: 4.4375
Average throughput: 73,692 tokens/sec

Research Context

This model is part of Phase 2 of the Oren project, which validates the hypothesis:

"Quality-filtered training data enables smaller, more efficient models with comparable performance."

Experiment Setup

I trained two identical models:

Model A (this model): Trained on raw data
Model B (companion model): Trained on quality-filtered data (top 70%)

Key Finding: Model B achieved similar performance (4.44 vs 4.38 loss) with 29% less training data and 29% less training time.

Usage

import torch
from nanochat.gpt import GPT, GPTConfig
from nanochat.tokenizer import get_tokenizer
from nanochat.engine import Engine

# Load model
checkpoint = torch.load("pytorch_model.bin")
config = GPTConfig(**{
    "sequence_len": 512,
    "vocab_size": 65536,
    "n_layer": 10,
    "n_head": 10,
    "n_kv_head": 10,
    "n_embd": 640
})

model = GPT(config)
model.load_state_dict(checkpoint)
model.eval()

# Generate text
tokenizer = get_tokenizer()
engine = Engine(model, tokenizer)

prompt_tokens = tokenizer("The capital of France is", prepend="<|bos|>")
output, _ = engine.generate_batch(prompt_tokens, max_tokens=50, temperature=0.7)
print(tokenizer.decode(output[0]))

Limitations

Small model: ~100M parameters - not suitable for complex reasoning
Limited training: Only 500M tokens of training data
No instruction tuning: This is a base model, not aligned for chat
Research artifact: Trained to validate data quality hypothesis, not for production use

Ethical Considerations

Trained on Common Crawl data, which may contain biases
Should not be used for critical applications without further evaluation
May generate offensive or incorrect content

Citation

If you use this model, please cite:

@software{oren2025,
  title={Oren: Quality Auditing for LLM Training Data},
  author={Amir Valizadeh},
  year={2025},
  url={https://github.com/vitalune/Oren}
}

Related Models

Companion model: vitalune/nanochat-d10-raw-700m
Auditor model: karpathy/nanochat-d32

Contact

GitHub: github/vitalune/Oren
Email: [email protected]
LinkedIn: LinkedIn/amir-valizadeh104

License

MIT License - See LICENSE file for details

Downloads last month: 33

Dataset used to train vitalune/nanochat-d10-filtered-500m

Collection including vitalune/nanochat-d10-filtered-500m

Oren Data Distillation Experiment

Collection

Two identical d10 models (100M params) trained to validate the hypothesis that quality-filtered data enables more efficient training. • 2 items • Updated 29 days ago • 1