tinymathstories-dense-130m

A dense (non-MoE) transformer language model trained on TinyMathStories — an experimental first step in exploring whether small language models can learn coherent storytelling alongside mathematical reasoning.

⚠️ Experimental Release: This model represents an initial proof-of-concept. It serves as a baseline for understanding how tiny models (≈100M parameters) handle the intersection of natural language and mathematical operations. More comprehensive experiments, scaled datasets, and rigorous evaluations are planned for future iterations.

Research Objective

This model investigates a core question: Can resource-constrained language models learn to produce coherent English narratives while simultaneously performing basic mathematical reasoning?

Inspired by the TinyStories work (Eldan & Li, 2023), which demonstrated that small models can generate surprisingly coherent text when trained on appropriately scoped data, this project extends that hypothesis to mathematical domains. Specifically, we aim to:

Train models with ≈100M parameters to handle grade-school math (arithmetic, fractions, measurements, simple equations) embedded naturally in stories
Evaluate whether narrative context aids mathematical reasoning in small models
Establish baseline architectures and training procedures for future scaling experiments
Understand the minimal model capacity needed for coherent math-in-language tasks

This 130M-parameter model is intentionally small-scale to validate the approach before investing in larger models and datasets.

Model Details

Model Type: Causal Language Model (Dense Transformer)
Architecture: Modern efficient transformer with GQA, RoPE, RMSNorm, SwiGLU
Parameters: ~130M
Training Dataset: TinyMathStories v1 (~4B tokens over 5 epochs)
Tokenizer: openai/gpt-oss-20b (vocab_size=200019)
Training Objective: Next-token prediction on stories with embedded mathematical reasoning

Architecture Details

Layers: 8
Model Dimension: 512
Attention Heads: 8 (Query) / 4 (Key/Value)
FFN Hidden Dim: 2048
Context Length: 1024 tokens
Vocab Size: 200019

Architectural Features

✅ Grouped Query Attention (GQA): Reduces memory and improves inference efficiency
✅ Rotary Position Embeddings (RoPE): Better handling of positional information
✅ RMSNorm: Faster and more stable normalization than LayerNorm
✅ SwiGLU Activation: Enhanced expressiveness in feedforward networks
✅ QK Normalization: Stabilized attention scores during training
✅ Parallel Residual Connections: Improved gradient flow and training speed
✅ Document-level Masking: Prevents cross-document attention, preserving story boundaries

These architectural choices reflect current best practices for training efficient small-scale transformers.

Training Configuration

Training Epochs: 5
Final Validation Loss: 1.600
Total Training Tokens: ~4 Billion

Capabilities & Intended Use

This model is designed for:

Generating short stories (TinyStories style) with integrated mathematical concepts
Demonstrating basic arithmetic operations in narrative context
Performing simple calculations (addition, subtraction, multiplication, division)
Working with fractions, decimals, percentages, and unit conversions
Maintaining story coherence while introducing mathematical elements

Primary use case: Research into small language models, mathematical reasoning in narrative contexts, and efficient model architectures.

Limitations & Known Issues

As an initial experimental release, this model has several important limitations:

Narrow training distribution: Trained exclusively on TinyMathStories; performance degrades significantly on out-of-distribution tasks
Limited mathematical scope: Handles only basic grade-school math; not suitable for advanced mathematics, multi-step word problems, or complex reasoning chains
Small context window: 1024-token limit restricts handling of longer narratives or multi-part problems
No instruction tuning: This is a base language model without safety training, instruction following, or alignment
Calculation accuracy: May produce plausible-looking but incorrect arithmetic, especially for numbers outside the training distribution
Evaluation pending: Comprehensive benchmarking on math reasoning tasks (GSM8K, MATH, etc.) not yet performed
Limited generalization: May struggle with mathematical concepts presented differently than in training data

This model is not suitable for:

Production applications requiring reliable mathematical computation
Educational tools without human verification of outputs
Any task requiring factual accuracy or safety guarantees

Evaluation & Benchmarks

Current Status: Initial validation loss reported; comprehensive evaluation in progress.

Planned Evaluations:

Exact-match accuracy on math problems from training/validation splits
Perplexity on held-out TinyMathStories samples
Zero-shot performance on simplified versions of GSM8K, MATH datasets
Story coherence metrics (automated and human evaluation)
Ablation studies comparing architectural choices

Results will be published in future model iterations and accompanying research documentation.

Future Directions

This initial model serves as a foundation for more ambitious research:

Scaled training: Larger models (300M, 1B+ parameters) and expanded datasets (100K+ stories)
Improved data quality: More diverse mathematical concepts, better reasoning templates, validated correctness
Rigorous evaluation: Standardized benchmarks, comparison with baseline models, human evaluation protocols
Architectural experiments: MoE variants, different attention mechanisms, optimal layer/dimension ratios
Curriculum learning: Progressive difficulty in mathematical concepts during training
Multi-task training: Combining pure language modeling, math-in-stories, and formal problem-solving
Instruction tuning: Fine-tuned versions for interactive problem-solving and tutoring

We welcome community feedback, ablation suggestions, and collaboration on next-generation models.

Reproducibility

Training code and detailed experimental configurations are available in the model repository. We are committed to making this research fully reproducible and will continue to document our training procedures and hyperparameters.

Citation

If you use this model in your research, please cite:

@misc{tinymathstories_dense_130m_2025,
  author = {AlgoDriveAI},
  title = {tinymathstories-dense-130m: An Experimental Small Language Model for Math Reasoning in Stories},
  year = {2025},
  publisher = {HuggingFace},
  journal = {HuggingFace Model Hub},
  howpublished = {\url{https://huggingface.co/AlgoDriveAI/tinymathstories-dense-130m}},
  note = {Initial experimental release}
}

Related Work:

@article{eldan2023tinystories,
  title={TinyStories: How Small Can Language Models Be and Still Speak Coherent English?},
  author={Eldan, Ronen and Li, Yuanzhi},
  journal={arXiv preprint arXiv:2305.07759},
  year={2023}
}

@dataset{algodriveai_2025_tinymathstories,
  title={TinyMathStories: TinyStories-style Corpus with Math and Reasoning},
  author={AlgoDriveAI},
  year={2025},
  note={Hugging Face dataset}
}

Maintainer & Contact

AlgoDriveAI — This model represents our first attempt at combining narrative coherence with mathematical reasoning in tiny language models. We're actively seeking:

Feedback on model performance and failure modes
Suggestions for evaluation methodologies
Collaborators for scaling experiments
Use cases and downstream applications

For questions, issues, or collaboration inquiries, please open an issue on the model repository or contact us through the HuggingFace platform.

Disclaimer: This is a research prototype. Use at your own risk. The model may produce incorrect mathematical calculations, biased content, or incoherent text. Always verify outputs, especially for educational or decision-critical applications.

Downloads last month: 41

AlgoDriveAI
/

tinymathstories-130m