--- language: en license: mit datasets: - fineweb-edu tags: - llama - sparse - llm - sparse-pretraining metrics: - perplexity arxiv: 2501.12486 --- # tjingrant/sparsellm-1b-40p This is a sparse language model based on LLaMA2-1B with 40% sparsity. ## Model Details - **Model Type:** Sparse Causal Language Model - **Base Model:** LLaMA2-1B - **Sparsity Configuration:** 40% sparsity - **Training Data:** Trained on the Fineweb-Edu dataset - **Tokenizer:** Same as the original LLaMA2 model - **Perplexity:** 19.93 (measured on Wikitext-103) - **Parameter Counts:** - Total Parameters: 1.20B - Total Linear Parameters: 1.14B - Non-zero Linear Parameters: 0.68B - Linear Layer Sparsity: 40.00% - Average Linear Parameters During Training: 0.87B (Average Density: 0.7651) ## Training Parameters - **Training Steps:** 13050 - **Batch Size:** 8M tokens (4096 × 2048) - **Learning Rate:** 0.0003 - **Total Training Tokens:** 104400000000 (104.4B) - **Final Training Loss:** 2.1374 ± 0.0134 (from last 1% of steps) - **Pruning Start Step:** 2500 - **Pruning End Step:** 8875 - **Matching Dense Model:** [sparsellm-1b-40p-small-dense](https://huggingface.co/tjingrant/sparsellm-1b-40p-small-dense) ## Performance and Training Details Here is the performance and parameter information for all models in this series: | Model | Total Params | Linear Params | Avg Linear Params | Non-Zero Linear | Sparsity | Batch Size | LR | Total Tokens | Final Train Loss | Perplexity | |-------|--------------|---------------|------------------|----------------|----------|------------|-----|-------------|------------------|------------| | [sparsellm-1b-20p](https://huggingface.co/tjingrant/sparsellm-1b-20p) | 1.20B | 1.14B | 1.02B | 0.91B | 20.00% | 8M | 3e-4 | 89.6B | 2.133 ± 0.022 | 19.58 | | [sparsellm-1b-40p](https://huggingface.co/tjingrant/sparsellm-1b-40p) | 1.20B | 1.14B | 0.87B | 0.68B | 40.00% | 8M | 3e-4 | 104.4B | 2.137 ± 0.013 | 19.93 | | [sparsellm-1b-60p](https://huggingface.co/tjingrant/sparsellm-1b-60p) | 1.20B | 1.14B | 0.69B | 0.46B | 60.00% | 8M | 3e-4 | 131.0B | 2.182 ± 0.017 | 20.80 | | [sparsellm-1b-80p](https://huggingface.co/tjingrant/sparsellm-1b-80p) | 1.20B | 1.14B | 0.45B | 0.23B | 80.00% | 8M | 3e-4 | 200.4B | 2.228 ± 0.021 | 25.77 | | [sparsellm-1b-20p-small-dense](https://huggingface.co/tjingrant/sparsellm-1b-20p-small-dense) | 1.07B | 1.01B | 1.01B | 1.01B | 0.00% | 8M | 3e-4 | 89.6B | 2.139 ± 0.022 | 19.49 | | [sparsellm-1b-40p-small-dense](https://huggingface.co/tjingrant/sparsellm-1b-40p-small-dense) | 0.88B | 0.82B | 0.82B | 0.82B | 0.00% | 8M | 3e-4 | 104.4B | 2.161 ± 0.024 | 21.40 | | [sparsellm-1b-60p-small-dense](https://huggingface.co/tjingrant/sparsellm-1b-60p-small-dense) | 0.70B | 0.65B | 0.65B | 0.65B | 0.00% | 8M | 3e-4 | 131.0B | 2.209 ± 0.021 | 22.58 | | [sparsellm-1b-80p-small-dense](https://huggingface.co/tjingrant/sparsellm-1b-80p-small-dense) | 0.46B | 0.42B | 0.42B | 0.42B | 0.00% | 8M | 3e-4 | 200.4B | 2.237 ± 0.028 | 24.57 | Notes: - **Perplexity** is measured on Wikitext-103 - **Batch Size** is given in tokens (samples × sequence length) - **Total Tokens** = Training Steps × Batch Size - **Final Train Loss** is computed from the last 1% of training steps (mean ± std) - **Avg Linear Params** is the average number of active parameters during training, computed from the pruning schedule - Rows 1-4 are sparse models, rows 5-8 are the matching dense models with (approximately) matching average parameter counts over pretraining. ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("{model_name}") model = AutoModelForCausalLM.from_pretrained("{model_name}") inputs = tokenizer("Hello, my name is", return_tensors="pt") outputs = model.generate(**inputs, max_length=50) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Citation If you use this model in your research, please cite our paper: **[The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws](https://arxiv.org/abs/2501.12486)** ```bibtex @inproceedings{{ jin2025the, title={{The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws}}, author={{Tian Jin and Ahmed Imtiaz Humayun and Utku Evci and Suvinay Subramanian and Amir Yazdanbakhsh and Dan Alistarh and Gintare Karolina Dziugaite}}, booktitle={{The Thirteenth International Conference on Learning Representations}}, year={{2025}}, }} ```