Safetensors
llama
s-kangasl's picture
Update README.md
f47d6e2 verified
---
license: llama3.2
datasets:
- EleutherAI/the_pile_deduplicated
---
## Model Description
Boomerang distillation is a phenomenon in LLMs where we can distill a teacher model into a student and reincorporate teacher layers to create intermediate-sized models with no additional training. This is the Llama student model from [our paper](https://arxiv.org/abs/2510.05064).
## Training Procedure
This model was initialized from Llama-3.2-3B by copying the first two layers and every 2nd subsequent layer. It was distilled on 2.1B tokens of The Pile deduplicated with cross entropy, KL, and cosine loss to match the activations of Llama-3.2-3B. We used the following hyperparameters:
- Learning rate: 3e-4
- Learning rate scheduler: cosine
- Warmup ratio: 0.01
- Optimizer: AdamW
- Adam betas: (0.9, 0.95)
- Adam epsilon: 1e-8
- Weight decay: 0.1
- Max. gradient norm: 1.0
- Number of training steps: 500
- Max. sequence length: 2048
- Effective batch size: 2048
- Mixed precision: bf16
- KLDiv weight: 0.1
- Cosine distance weight per layer: 0.125
## Use
To interpolate between this model and Llama-3.2-3B, please use the `build_intermediate_model` function from [our github repository](https://github.com/dcml-lab/boomerang-distillation):
```python3
import torch
from patching.patch import build_intermediate_model
intermediate_model = build_intermediate_model(
teacher_name_or_path = "meta-llama/Llama-3.2-3B",
student_name_or_path = "Harvard-DCML/boomerang-llama-3.2-1.9B",
num_layers_to_patch = 2,
patch_first_k_layers = True,
dtype = torch.bfloat16,
)
```
Notes:
1. Changing `num_layers_to_patch` changes the size of the intermediate model by patching different numbers of student layers.
2. `patch_first_k_layers` should be set to True for this model for optimal interpolation performance.
## Citation
```
@article{kangaslahti2025boomerang,
title={Boomerang Distillation Enables Zero-Shot Model Size Interpolation},
author={Kangaslahti, Sara and Nayak, Nihal V and Geuter, Jonathan and Fumero, Marco and Locatello, Francesco and Alvarez-Melis, David},
journal={arXiv preprint arXiv:2510.05064},
year={2025},
url={https://arxiv.org/abs/2510.05064}
}
```