File size: 4,537 Bytes
2b765c9
63baed5
 
 
 
 
 
 
 
 
 
 
 
2b765c9
 
63baed5
2b765c9
63baed5
2b765c9
 
 
63baed5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
language: en
license: mit
datasets:
- fineweb-edu
tags:
- llama
- sparse
- llm
- sparse-pretraining
metrics:
- perplexity
arxiv: 2501.12486
---

# tjingrant/sparsellm-1b-40p

This is a sparse language model based on LLaMA2-1B with 40% sparsity.

## Model Details

- **Model Type:** Sparse Causal Language Model
- **Base Model:** LLaMA2-1B
- **Sparsity Configuration:** 40% sparsity
- **Training Data:** Trained on the Fineweb-Edu dataset
- **Tokenizer:** Same as the original LLaMA2 model
- **Perplexity:** 19.93 (measured on Wikitext-103)
- **Parameter Counts:**
  - Total Parameters: 1.20B
  - Total Linear Parameters: 1.14B
  - Non-zero Linear Parameters: 0.68B
  - Linear Layer Sparsity: 40.00%
  - Average Linear Parameters During Training: 0.87B (Average Density: 0.7651)
## Training Parameters

- **Training Steps:** 13050
- **Batch Size:** 8M tokens (4096 × 2048)
- **Learning Rate:** 0.0003
- **Total Training Tokens:** 104400000000 (104.4B)
- **Final Training Loss:** 2.1374 ± 0.0134 (from last 1% of steps)
- **Pruning Start Step:** 2500
- **Pruning End Step:** 8875
- **Matching Dense Model:** [sparsellm-1b-40p-small-dense](https://huggingface.co/tjingrant/sparsellm-1b-40p-small-dense)


## Performance and Training Details

Here is the performance and parameter information for all models in this series:

| Model | Total Params | Linear Params | Avg Linear Params | Non-Zero Linear | Sparsity | Batch Size | LR | Total Tokens | Final Train Loss | Perplexity |
|-------|--------------|---------------|------------------|----------------|----------|------------|-----|-------------|------------------|------------|
| [sparsellm-1b-20p](https://huggingface.co/tjingrant/sparsellm-1b-20p) | 1.20B | 1.14B | 1.02B | 0.91B | 20.00% | 8M | 3e-4 | 89.6B | 2.133 ± 0.022 | 19.58 |
| [sparsellm-1b-40p](https://huggingface.co/tjingrant/sparsellm-1b-40p) | 1.20B | 1.14B | 0.87B | 0.68B | 40.00% | 8M | 3e-4 | 104.4B | 2.137 ± 0.013 | 19.93 |
| [sparsellm-1b-60p](https://huggingface.co/tjingrant/sparsellm-1b-60p) | 1.20B | 1.14B | 0.69B | 0.46B | 60.00% | 8M | 3e-4 | 131.0B | 2.182 ± 0.017 | 20.80 |
| [sparsellm-1b-80p](https://huggingface.co/tjingrant/sparsellm-1b-80p) | 1.20B | 1.14B | 0.45B | 0.23B | 80.00% | 8M | 3e-4 | 200.4B | 2.228 ± 0.021 | 25.77 |
| [sparsellm-1b-20p-small-dense](https://huggingface.co/tjingrant/sparsellm-1b-20p-small-dense) | 1.07B | 1.01B | 1.01B | 1.01B | 0.00% | 8M | 3e-4 | 89.6B | 2.139 ± 0.022 | 19.49 |
| [sparsellm-1b-40p-small-dense](https://huggingface.co/tjingrant/sparsellm-1b-40p-small-dense) | 0.88B | 0.82B | 0.82B | 0.82B | 0.00% | 8M | 3e-4 | 104.4B | 2.161 ± 0.024 | 21.40 |
| [sparsellm-1b-60p-small-dense](https://huggingface.co/tjingrant/sparsellm-1b-60p-small-dense) | 0.70B | 0.65B | 0.65B | 0.65B | 0.00% | 8M | 3e-4 | 131.0B | 2.209 ± 0.021 | 22.58 |
| [sparsellm-1b-80p-small-dense](https://huggingface.co/tjingrant/sparsellm-1b-80p-small-dense) | 0.46B | 0.42B | 0.42B | 0.42B | 0.00% | 8M | 3e-4 | 200.4B | 2.237 ± 0.028 | 24.57 |

Notes:
- **Perplexity** is measured on Wikitext-103
- **Batch Size** is given in tokens (samples × sequence length)
- **Total Tokens** = Training Steps × Batch Size
- **Final Train Loss** is computed from the last 1% of training steps (mean ± std)
- **Avg Linear Params** is the average number of active parameters during training, computed from the pruning schedule
- Rows 1-4 are sparse models, rows 5-8 are the matching dense models with (approximately) matching average parameter counts over pretraining.


## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("{model_name}")
model = AutoModelForCausalLM.from_pretrained("{model_name}")

inputs = tokenizer("Hello, my name is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Citation

If you use this model in your research, please cite our paper:

**[The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws](https://arxiv.org/abs/2501.12486)**

```bibtex
@inproceedings{{
jin2025the,
title={{The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws}},
author={{Tian Jin and Ahmed Imtiaz Humayun and Utku Evci and Suvinay Subramanian and Amir Yazdanbakhsh and Dan Alistarh and Gintare Karolina Dziugaite}},
booktitle={{The Thirteenth International Conference on Learning Representations}},
year={{2025}},
}}
```