Safetensors
llama

Model Description

Boomerang distillation is a phenomenon in LLMs where we can distill a teacher model into a student and reincorporate teacher layers to create intermediate-sized models with no additional training. This is the Llama student model from our paper.

Training Procedure

This model was initialized from Llama-3.2-3B by copying the first two layers and every 2nd subsequent layer. It was distilled on 2.1B tokens of The Pile deduplicated with cross entropy, KL, and cosine loss to match the activations of Llama-3.2-3B. We used the following hyperparameters:

  • Learning rate: 3e-4
  • Learning rate scheduler: cosine
  • Warmup ratio: 0.01
  • Optimizer: AdamW
  • Adam betas: (0.9, 0.95)
  • Adam epsilon: 1e-8
  • Weight decay: 0.1
  • Max. gradient norm: 1.0
  • Number of training steps: 500
  • Max. sequence length: 2048
  • Effective batch size: 2048
  • Mixed precision: bf16
  • KLDiv weight: 0.1
  • Cosine distance weight per layer: 0.125

Use

To interpolate between this model and Llama-3.2-3B, please use the build_intermediate_model function from our github repository:

import torch
from patching.patch import build_intermediate_model

intermediate_model = build_intermediate_model(
  teacher_name_or_path = "meta-llama/Llama-3.2-3B",
  student_name_or_path = "Harvard-DCML/boomerang-llama-3.2-1.9B",
  num_layers_to_patch = 2,
  patch_first_k_layers = True,
  dtype = torch.bfloat16,
)

Notes:

  1. Changing num_layers_to_patch changes the size of the intermediate model by patching different numbers of student layers.
  2. patch_first_k_layers should be set to True for this model for optimal interpolation performance.

Citation

@article{kangaslahti2025boomerang,
  title={Boomerang Distillation Enables Zero-Shot Model Size Interpolation},
  author={Kangaslahti, Sara and Nayak, Nihal V and Geuter, Jonathan and Fumero, Marco and Locatello, Francesco and Alvarez-Melis, David},
  journal={arXiv preprint arXiv:2510.05064},
  year={2025},
  url={https://arxiv.org/abs/2510.05064}
}
Downloads last month
31
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Harvard-DCML/boomerang-llama-3.2-1.9B

Quantizations
1 model

Dataset used to train Harvard-DCML/boomerang-llama-3.2-1.9B

Collection including Harvard-DCML/boomerang-llama-3.2-1.9B