The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws
Paper
•
2501.12486
•
Published
This is a small dense language model serving as a quality reference for sparsellm-1b model with 80% sparsity.
Here is the performance and parameter information for all models in this series:
| Model | Total Params | Linear Params | Avg Linear Params | Non-Zero Linear | Sparsity | Batch Size | LR | Total Tokens | Final Train Loss | Perplexity |
|---|---|---|---|---|---|---|---|---|---|---|
| sparsellm-1b-20p | 1.20B | 1.14B | 1.02B | 0.91B | 20.00% | 8M | 3e-4 | 89.6B | 2.133 ± 0.022 | 19.58 |
| sparsellm-1b-40p | 1.20B | 1.14B | 0.87B | 0.68B | 40.00% | 8M | 3e-4 | 104.4B | 2.137 ± 0.013 | 19.93 |
| sparsellm-1b-60p | 1.20B | 1.14B | 0.69B | 0.46B | 60.00% | 8M | 3e-4 | 131.0B | 2.182 ± 0.017 | 20.80 |
| sparsellm-1b-80p | 1.20B | 1.14B | 0.45B | 0.23B | 80.00% | 8M | 3e-4 | 200.4B | 2.228 ± 0.021 | 25.77 |
| sparsellm-1b-20p-small-dense | 1.07B | 1.01B | 1.01B | 1.01B | 0.00% | 8M | 3e-4 | 89.6B | 2.139 ± 0.022 | 19.49 |
| sparsellm-1b-40p-small-dense | 0.88B | 0.82B | 0.82B | 0.82B | 0.00% | 8M | 3e-4 | 104.4B | 2.161 ± 0.024 | 21.40 |
| sparsellm-1b-60p-small-dense | 0.70B | 0.65B | 0.65B | 0.65B | 0.00% | 8M | 3e-4 | 131.0B | 2.209 ± 0.021 | 22.58 |
| sparsellm-1b-80p-small-dense | 0.46B | 0.42B | 0.42B | 0.42B | 0.00% | 8M | 3e-4 | 200.4B | 2.237 ± 0.028 | 24.57 |
Notes:
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("{model_name}")
model = AutoModelForCausalLM.from_pretrained("{model_name}")
inputs = tokenizer("Hello, my name is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
If you use this model in your research, please cite our paper:
The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws
@inproceedings{{
jin2025the,
title={{The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws}},
author={{Tian Jin and Ahmed Imtiaz Humayun and Utku Evci and Suvinay Subramanian and Amir Yazdanbakhsh and Dan Alistarh and Gintare Karolina Dziugaite}},
booktitle={{The Thirteenth International Conference on Learning Representations}},
year={{2025}},
}}