Boomerang Distillation
Collection
Distilled models from the boomerang distillation paper (https://arxiv.org/abs/2510.05064).
•
6 items
•
Updated
•
1
Boomerang distillation is a phenomenon in LLMs where we can distill a teacher model into a student and reincorporate teacher layers to create intermediate-sized models with no additional training. This is the Llama student model from our paper.
This model was initialized from Llama-3.2-3B by copying the first two layers and every 2nd subsequent layer. It was distilled on 2.1B tokens of The Pile deduplicated with cross entropy, KL, and cosine loss to match the activations of Llama-3.2-3B. We used the following hyperparameters:
To interpolate between this model and Llama-3.2-3B, please use the build_intermediate_model function from our github repository:
import torch
from patching.patch import build_intermediate_model
intermediate_model = build_intermediate_model(
teacher_name_or_path = "meta-llama/Llama-3.2-3B",
student_name_or_path = "Harvard-DCML/boomerang-llama-3.2-1.9B",
num_layers_to_patch = 2,
patch_first_k_layers = True,
dtype = torch.bfloat16,
)
Notes:
num_layers_to_patch changes the size of the intermediate model by patching different numbers of student layers.patch_first_k_layers should be set to True for this model for optimal interpolation performance.@article{kangaslahti2025boomerang,
title={Boomerang Distillation Enables Zero-Shot Model Size Interpolation},
author={Kangaslahti, Sara and Nayak, Nihal V and Geuter, Jonathan and Fumero, Marco and Locatello, Francesco and Alvarez-Melis, David},
journal={arXiv preprint arXiv:2510.05064},
year={2025},
url={https://arxiv.org/abs/2510.05064}
}