File size: 4,537 Bytes
2b765c9 63baed5 2b765c9 63baed5 2b765c9 63baed5 2b765c9 63baed5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
---
language: en
license: mit
datasets:
- fineweb-edu
tags:
- llama
- sparse
- llm
- sparse-pretraining
metrics:
- perplexity
arxiv: 2501.12486
---
# tjingrant/sparsellm-1b-40p
This is a sparse language model based on LLaMA2-1B with 40% sparsity.
## Model Details
- **Model Type:** Sparse Causal Language Model
- **Base Model:** LLaMA2-1B
- **Sparsity Configuration:** 40% sparsity
- **Training Data:** Trained on the Fineweb-Edu dataset
- **Tokenizer:** Same as the original LLaMA2 model
- **Perplexity:** 19.93 (measured on Wikitext-103)
- **Parameter Counts:**
- Total Parameters: 1.20B
- Total Linear Parameters: 1.14B
- Non-zero Linear Parameters: 0.68B
- Linear Layer Sparsity: 40.00%
- Average Linear Parameters During Training: 0.87B (Average Density: 0.7651)
## Training Parameters
- **Training Steps:** 13050
- **Batch Size:** 8M tokens (4096 × 2048)
- **Learning Rate:** 0.0003
- **Total Training Tokens:** 104400000000 (104.4B)
- **Final Training Loss:** 2.1374 ± 0.0134 (from last 1% of steps)
- **Pruning Start Step:** 2500
- **Pruning End Step:** 8875
- **Matching Dense Model:** [sparsellm-1b-40p-small-dense](https://huggingface.co/tjingrant/sparsellm-1b-40p-small-dense)
## Performance and Training Details
Here is the performance and parameter information for all models in this series:
| Model | Total Params | Linear Params | Avg Linear Params | Non-Zero Linear | Sparsity | Batch Size | LR | Total Tokens | Final Train Loss | Perplexity |
|-------|--------------|---------------|------------------|----------------|----------|------------|-----|-------------|------------------|------------|
| [sparsellm-1b-20p](https://huggingface.co/tjingrant/sparsellm-1b-20p) | 1.20B | 1.14B | 1.02B | 0.91B | 20.00% | 8M | 3e-4 | 89.6B | 2.133 ± 0.022 | 19.58 |
| [sparsellm-1b-40p](https://huggingface.co/tjingrant/sparsellm-1b-40p) | 1.20B | 1.14B | 0.87B | 0.68B | 40.00% | 8M | 3e-4 | 104.4B | 2.137 ± 0.013 | 19.93 |
| [sparsellm-1b-60p](https://huggingface.co/tjingrant/sparsellm-1b-60p) | 1.20B | 1.14B | 0.69B | 0.46B | 60.00% | 8M | 3e-4 | 131.0B | 2.182 ± 0.017 | 20.80 |
| [sparsellm-1b-80p](https://huggingface.co/tjingrant/sparsellm-1b-80p) | 1.20B | 1.14B | 0.45B | 0.23B | 80.00% | 8M | 3e-4 | 200.4B | 2.228 ± 0.021 | 25.77 |
| [sparsellm-1b-20p-small-dense](https://huggingface.co/tjingrant/sparsellm-1b-20p-small-dense) | 1.07B | 1.01B | 1.01B | 1.01B | 0.00% | 8M | 3e-4 | 89.6B | 2.139 ± 0.022 | 19.49 |
| [sparsellm-1b-40p-small-dense](https://huggingface.co/tjingrant/sparsellm-1b-40p-small-dense) | 0.88B | 0.82B | 0.82B | 0.82B | 0.00% | 8M | 3e-4 | 104.4B | 2.161 ± 0.024 | 21.40 |
| [sparsellm-1b-60p-small-dense](https://huggingface.co/tjingrant/sparsellm-1b-60p-small-dense) | 0.70B | 0.65B | 0.65B | 0.65B | 0.00% | 8M | 3e-4 | 131.0B | 2.209 ± 0.021 | 22.58 |
| [sparsellm-1b-80p-small-dense](https://huggingface.co/tjingrant/sparsellm-1b-80p-small-dense) | 0.46B | 0.42B | 0.42B | 0.42B | 0.00% | 8M | 3e-4 | 200.4B | 2.237 ± 0.028 | 24.57 |
Notes:
- **Perplexity** is measured on Wikitext-103
- **Batch Size** is given in tokens (samples × sequence length)
- **Total Tokens** = Training Steps × Batch Size
- **Final Train Loss** is computed from the last 1% of training steps (mean ± std)
- **Avg Linear Params** is the average number of active parameters during training, computed from the pruning schedule
- Rows 1-4 are sparse models, rows 5-8 are the matching dense models with (approximately) matching average parameter counts over pretraining.
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("{model_name}")
model = AutoModelForCausalLM.from_pretrained("{model_name}")
inputs = tokenizer("Hello, my name is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Citation
If you use this model in your research, please cite our paper:
**[The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws](https://arxiv.org/abs/2501.12486)**
```bibtex
@inproceedings{{
jin2025the,
title={{The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws}},
author={{Tian Jin and Ahmed Imtiaz Humayun and Utku Evci and Suvinay Subramanian and Amir Yazdanbakhsh and Dan Alistarh and Gintare Karolina Dziugaite}},
booktitle={{The Thirteenth International Conference on Learning Representations}},
year={{2025}},
}}
```
|