nanochat-d10-filtered-500m

d10 model trained on 500M tokens of quality-filtered Common Crawl (score ≥ 0.7)

Model Description

This is a d10 model (~100M parameters) trained as part of a research project investigating the impact of training data quality on LLM performance.

Architecture: 10-layer transformer with 640 hidden dimensions Training framework: nanochat Base tokenizer: BPE with 65K vocab

Training Details

Dataset

  • Source: Common Crawl (quality-filtered)
  • Size: 500M tokens
  • Quality filtering: Quality score ≥ 0.7 (nanochat-d32 auditor)

Hyperparameters

  • Iterations: 10,000
  • Batch size: 32
  • Sequence length: 512
  • Learning rate: 6e-4
  • Training time: 2.7 hours
  • Hardware: 2× NVIDIA RTX 6000 Ada

Training Results

  • Final training loss: 4.4375
  • Average throughput: 73,692 tokens/sec

Research Context

This model is part of Phase 2 of the Oren project, which validates the hypothesis:

"Quality-filtered training data enables smaller, more efficient models with comparable performance."

Experiment Setup

I trained two identical models:

  • Model A (this model): Trained on raw data
  • Model B (companion model): Trained on quality-filtered data (top 70%)

Key Finding: Model B achieved similar performance (4.44 vs 4.38 loss) with 29% less training data and 29% less training time.

Usage

import torch
from nanochat.gpt import GPT, GPTConfig
from nanochat.tokenizer import get_tokenizer
from nanochat.engine import Engine

# Load model
checkpoint = torch.load("pytorch_model.bin")
config = GPTConfig(**{
    "sequence_len": 512,
    "vocab_size": 65536,
    "n_layer": 10,
    "n_head": 10,
    "n_kv_head": 10,
    "n_embd": 640
})

model = GPT(config)
model.load_state_dict(checkpoint)
model.eval()

# Generate text
tokenizer = get_tokenizer()
engine = Engine(model, tokenizer)

prompt_tokens = tokenizer("The capital of France is", prepend="<|bos|>")
output, _ = engine.generate_batch(prompt_tokens, max_tokens=50, temperature=0.7)
print(tokenizer.decode(output[0]))

Limitations

  • Small model: ~100M parameters - not suitable for complex reasoning
  • Limited training: Only 500M tokens of training data
  • No instruction tuning: This is a base model, not aligned for chat
  • Research artifact: Trained to validate data quality hypothesis, not for production use

Ethical Considerations

  • Trained on Common Crawl data, which may contain biases
  • Should not be used for critical applications without further evaluation
  • May generate offensive or incorrect content

Citation

If you use this model, please cite:

@software{oren2025,
  title={Oren: Quality Auditing for LLM Training Data},
  author={Amir Valizadeh},
  year={2025},
  url={https://github.com/vitalune/Oren}
}

Related Models

Contact

License

MIT License - See LICENSE file for details

Downloads last month
33
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train vitalune/nanochat-d10-filtered-500m

Collection including vitalune/nanochat-d10-filtered-500m