Harvard-DCML
/

boomerang-llama-3.2-1.9B

Model card Files Files and versions

boomerang-llama-3.2-1.9B / README.md

s-kangasl's picture

Update README.md

f47d6e2 verified 29 days ago

|

history blame contribute delete

2.19 kB

	---
	license: llama3.2
	datasets:
	- EleutherAI/the_pile_deduplicated
	---
	## Model Description

	Boomerang distillation is a phenomenon in LLMs where we can distill a teacher model into a student and reincorporate teacher layers to create intermediate-sized models with no additional training. This is the Llama student model from [our paper](https://arxiv.org/abs/2510.05064).

	## Training Procedure

	This model was initialized from Llama-3.2-3B by copying the first two layers and every 2nd subsequent layer. It was distilled on 2.1B tokens of The Pile deduplicated with cross entropy, KL, and cosine loss to match the activations of Llama-3.2-3B. We used the following hyperparameters:

	- Learning rate: 3e-4
	- Learning rate scheduler: cosine
	- Warmup ratio: 0.01
	- Optimizer: AdamW
	- Adam betas: (0.9, 0.95)
	- Adam epsilon: 1e-8
	- Weight decay: 0.1
	- Max. gradient norm: 1.0
	- Number of training steps: 500
	- Max. sequence length: 2048
	- Effective batch size: 2048
	- Mixed precision: bf16
	- KLDiv weight: 0.1
	- Cosine distance weight per layer: 0.125

	## Use

	To interpolate between this model and Llama-3.2-3B, please use the `build_intermediate_model` function from [our github repository](https://github.com/dcml-lab/boomerang-distillation):
	```python3
	import torch
	from patching.patch import build_intermediate_model

	intermediate_model = build_intermediate_model(
	teacher_name_or_path = "meta-llama/Llama-3.2-3B",
	student_name_or_path = "Harvard-DCML/boomerang-llama-3.2-1.9B",
	num_layers_to_patch = 2,
	patch_first_k_layers = True,
	dtype = torch.bfloat16,
	)
	```
	Notes:
	1. Changing `num_layers_to_patch` changes the size of the intermediate model by patching different numbers of student layers.
	2. `patch_first_k_layers` should be set to True for this model for optimal interpolation performance.

	## Citation

	```
	@article{kangaslahti2025boomerang,
	title={Boomerang Distillation Enables Zero-Shot Model Size Interpolation},
	author={Kangaslahti, Sara and Nayak, Nihal V and Geuter, Jonathan and Fumero, Marco and Locatello, Francesco and Alvarez-Melis, David},
	journal={arXiv preprint arXiv:2510.05064},
	year={2025},
	url={https://arxiv.org/abs/2510.05064}
	}
	```