Model Description

Boomerang distillation is a phenomenon in LLMs where we can distill a teacher model into a student and reincorporate teacher layers to create intermediate-sized models with no additional training. This is the Llama student model from our paper.

Training Procedure

This model was initialized from Llama-3.2-3B by copying the first two layers and every 2nd subsequent layer. It was distilled on 2.1B tokens of The Pile deduplicated with cross entropy, KL, and cosine loss to match the activations of Llama-3.2-3B. We used the following hyperparameters:

Learning rate: 3e-4
Learning rate scheduler: cosine
Warmup ratio: 0.01
Optimizer: AdamW
Adam betas: (0.9, 0.95)
Adam epsilon: 1e-8
Weight decay: 0.1
Max. gradient norm: 1.0
Number of training steps: 500
Max. sequence length: 2048
Effective batch size: 2048
Mixed precision: bf16
KLDiv weight: 0.1
Cosine distance weight per layer: 0.125

Use

To interpolate between this model and Llama-3.2-3B, please use the build_intermediate_model function from our github repository:

import torch
from patching.patch import build_intermediate_model

intermediate_model = build_intermediate_model(
  teacher_name_or_path = "meta-llama/Llama-3.2-3B",
  student_name_or_path = "Harvard-DCML/boomerang-llama-3.2-1.9B",
  num_layers_to_patch = 2,
  patch_first_k_layers = True,
  dtype = torch.bfloat16,
)

Notes:

Changing num_layers_to_patch changes the size of the intermediate model by patching different numbers of student layers.
patch_first_k_layers should be set to True for this model for optimal interpolation performance.

Citation

@article{kangaslahti2025boomerang,
  title={Boomerang Distillation Enables Zero-Shot Model Size Interpolation},
  author={Kangaslahti, Sara and Nayak, Nihal V and Geuter, Jonathan and Fumero, Marco and Locatello, Francesco and Alvarez-Melis, David},
  journal={arXiv preprint arXiv:2510.05064},
  year={2025},
  url={https://arxiv.org/abs/2510.05064}
}

Downloads last month: 31

Safetensors

Model size

2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Harvard-DCML/boomerang-llama-3.2-1.9B

Quantizations

1 model

Dataset used to train Harvard-DCML/boomerang-llama-3.2-1.9B

Collection including Harvard-DCML/boomerang-llama-3.2-1.9B

Boomerang Distillation

Collection

Distilled models from the boomerang distillation paper (https://arxiv.org/abs/2510.05064). • 6 items • Updated 16 days ago • 1