tinymathstories-dense-130m
A dense (non-MoE) transformer language model trained on TinyMathStories — an experimental first step in exploring whether small language models can learn coherent storytelling alongside mathematical reasoning.
⚠️ Experimental Release: This model represents an initial proof-of-concept. It serves as a baseline for understanding how tiny models (≈100M parameters) handle the intersection of natural language and mathematical operations. More comprehensive experiments, scaled datasets, and rigorous evaluations are planned for future iterations.
Research Objective
This model investigates a core question: Can resource-constrained language models learn to produce coherent English narratives while simultaneously performing basic mathematical reasoning?
Inspired by the TinyStories work (Eldan & Li, 2023), which demonstrated that small models can generate surprisingly coherent text when trained on appropriately scoped data, this project extends that hypothesis to mathematical domains. Specifically, we aim to:
- Train models with ≈100M parameters to handle grade-school math (arithmetic, fractions, measurements, simple equations) embedded naturally in stories
- Evaluate whether narrative context aids mathematical reasoning in small models
- Establish baseline architectures and training procedures for future scaling experiments
- Understand the minimal model capacity needed for coherent math-in-language tasks
This 130M-parameter model is intentionally small-scale to validate the approach before investing in larger models and datasets.
Model Details
- Model Type: Causal Language Model (Dense Transformer)
- Architecture: Modern efficient transformer with GQA, RoPE, RMSNorm, SwiGLU
- Parameters: ~130M
- Training Dataset: TinyMathStories v1 (~4B tokens over 5 epochs)
- Tokenizer: openai/gpt-oss-20b (vocab_size=200019)
- Training Objective: Next-token prediction on stories with embedded mathematical reasoning
Architecture Details
- Layers: 8
- Model Dimension: 512
- Attention Heads: 8 (Query) / 4 (Key/Value)
- FFN Hidden Dim: 2048
- Context Length: 1024 tokens
- Vocab Size: 200019
Architectural Features
- ✅ Grouped Query Attention (GQA): Reduces memory and improves inference efficiency
- ✅ Rotary Position Embeddings (RoPE): Better handling of positional information
- ✅ RMSNorm: Faster and more stable normalization than LayerNorm
- ✅ SwiGLU Activation: Enhanced expressiveness in feedforward networks
- ✅ QK Normalization: Stabilized attention scores during training
- ✅ Parallel Residual Connections: Improved gradient flow and training speed
- ✅ Document-level Masking: Prevents cross-document attention, preserving story boundaries
These architectural choices reflect current best practices for training efficient small-scale transformers.
Training Configuration
- Training Epochs: 5
- Final Validation Loss: 1.600
- Total Training Tokens: ~4 Billion
Capabilities & Intended Use
This model is designed for:
- Generating short stories (TinyStories style) with integrated mathematical concepts
- Demonstrating basic arithmetic operations in narrative context
- Performing simple calculations (addition, subtraction, multiplication, division)
- Working with fractions, decimals, percentages, and unit conversions
- Maintaining story coherence while introducing mathematical elements
Primary use case: Research into small language models, mathematical reasoning in narrative contexts, and efficient model architectures.
Limitations & Known Issues
As an initial experimental release, this model has several important limitations:
- Narrow training distribution: Trained exclusively on TinyMathStories; performance degrades significantly on out-of-distribution tasks
- Limited mathematical scope: Handles only basic grade-school math; not suitable for advanced mathematics, multi-step word problems, or complex reasoning chains
- Small context window: 1024-token limit restricts handling of longer narratives or multi-part problems
- No instruction tuning: This is a base language model without safety training, instruction following, or alignment
- Calculation accuracy: May produce plausible-looking but incorrect arithmetic, especially for numbers outside the training distribution
- Evaluation pending: Comprehensive benchmarking on math reasoning tasks (GSM8K, MATH, etc.) not yet performed
- Limited generalization: May struggle with mathematical concepts presented differently than in training data
This model is not suitable for:
- Production applications requiring reliable mathematical computation
- Educational tools without human verification of outputs
- Any task requiring factual accuracy or safety guarantees
Evaluation & Benchmarks
Current Status: Initial validation loss reported; comprehensive evaluation in progress.
Planned Evaluations:
- Exact-match accuracy on math problems from training/validation splits
- Perplexity on held-out TinyMathStories samples
- Zero-shot performance on simplified versions of GSM8K, MATH datasets
- Story coherence metrics (automated and human evaluation)
- Ablation studies comparing architectural choices
Results will be published in future model iterations and accompanying research documentation.
Future Directions
This initial model serves as a foundation for more ambitious research:
- Scaled training: Larger models (300M, 1B+ parameters) and expanded datasets (100K+ stories)
- Improved data quality: More diverse mathematical concepts, better reasoning templates, validated correctness
- Rigorous evaluation: Standardized benchmarks, comparison with baseline models, human evaluation protocols
- Architectural experiments: MoE variants, different attention mechanisms, optimal layer/dimension ratios
- Curriculum learning: Progressive difficulty in mathematical concepts during training
- Multi-task training: Combining pure language modeling, math-in-stories, and formal problem-solving
- Instruction tuning: Fine-tuned versions for interactive problem-solving and tutoring
We welcome community feedback, ablation suggestions, and collaboration on next-generation models.
Reproducibility
Training code and detailed experimental configurations are available in the model repository. We are committed to making this research fully reproducible and will continue to document our training procedures and hyperparameters.
Citation
If you use this model in your research, please cite:
@misc{tinymathstories_dense_130m_2025,
author = {AlgoDriveAI},
title = {tinymathstories-dense-130m: An Experimental Small Language Model for Math Reasoning in Stories},
year = {2025},
publisher = {HuggingFace},
journal = {HuggingFace Model Hub},
howpublished = {\url{https://huggingface.co/AlgoDriveAI/tinymathstories-dense-130m}},
note = {Initial experimental release}
}
Related Work:
@article{eldan2023tinystories,
title={TinyStories: How Small Can Language Models Be and Still Speak Coherent English?},
author={Eldan, Ronen and Li, Yuanzhi},
journal={arXiv preprint arXiv:2305.07759},
year={2023}
}
@dataset{algodriveai_2025_tinymathstories,
title={TinyMathStories: TinyStories-style Corpus with Math and Reasoning},
author={AlgoDriveAI},
year={2025},
note={Hugging Face dataset}
}
Maintainer & Contact
AlgoDriveAI — This model represents our first attempt at combining narrative coherence with mathematical reasoning in tiny language models. We're actively seeking:
- Feedback on model performance and failure modes
- Suggestions for evaluation methodologies
- Collaborators for scaling experiments
- Use cases and downstream applications
For questions, issues, or collaboration inquiries, please open an issue on the model repository or contact us through the HuggingFace platform.
Disclaimer: This is a research prototype. Use at your own risk. The model may produce incorrect mathematical calculations, biased content, or incoherent text. Always verify outputs, especially for educational or decision-critical applications.
- Downloads last month
- 41