jiaojuncao
/

MoVE-KD-13b-v1.1

Image-Text-to-Text

move_llava_llama

text-generation

Model card Files Files and versions

MoVE-KD-13b-v1.1 / README.md

jiaojuncao's picture

Update README.md

da8c6ad verified 8 months ago

|

history blame contribute delete

2.76 kB

	---
	license: apache-2.0
	pipeline_tag: image-text-to-text
	library_name: transformers
	---

	[【CVPR 2025】MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders](https://arxiv.org/abs/2501.01709)
	[Jiajun Cao](https://scholar.google.com.hk/citations?user=femNsd0AAAAJ&hl=zh-CN), [Yuan Zhang](https://scholar.google.com.hk/citations?hl=zh-CN&user=dXj1WskAAAAJ), [Tao Huang](https://scholar.google.com.hk/citations?user=jkcRdBgAAAAJ&hl=zh-CN), Ming Lu, Qizhe Zhang, Ruichuan An, Ningning MA, [Shanghang Zhang](https://scholar.google.com.hk/citations?user=voqw10cAAAAJ&hl=zh-CN)

	## Overview

	![overview](./pipeline.png)

	Visual encoders are fundamental components in vision-language models (VLMs), each showcasing unique strengths derived from various pre-trained visual foundation models. To leverage the various capabilities of these encoders, recent studies incorporate multiple encoders within a single VLM, leading to a considerable increase in computational cost. In this paper, we present Mixture-of-Visual-Encoder Knowledge Distillation (MoVE-KD), a novel framework that distills the unique proficiencies of multiple vision encoders into a single, efficient encoder model. Specifically, to mitigate conflicts and retain the unique characteristics of each teacher encoder, we employ low-rank adaptation (LoRA) and mixture-of-experts (MoEs) to selectively activate specialized knowledge based on input features, enhancing both adaptability and efficiency. To regularize the KD process and enhance performance, we propose an attention-based distillation strategy that adaptively weighs the different visual encoders and emphasizes valuable visual tokens, reducing the burden of replicating comprehensive but distinct features from multiple teachers.

	Code: https://github.com/hey-cjj/MoVE-KD

	## MoVE-KD Weights
	\| Method \| LLM \| VQAv2 \| GQA \| TextVQA \| VizWiz \| POPE \| SQA \| MME \| MMB \|
	\| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| LLaVA-v1.5 \| Vicuna-7B\| 78.5 \| 62.0 \| 58.2 \| 50.0 \| 85.9 \| 66.8 \| 1510.7 \| 64.3 \|
	\| [MoVE-KD-v1.0](https://huggingface.co/jiaojuncao/MoVE-KD-7b-v1.0) \| Vicuna-7B\| 79.5 \| 63.2 \| 58.3 \| 52.3 \| 86.9 \| 69.3 \| 1524.5 \| 66.3 \|
	\| [MoVE-KD-v1.1](https://huggingface.co/jiaojuncao/MoVE-KD-7b-v1.1) \| Vicuna-7B\| 79.9 \| 63.9 \| 59.6 \| 52.7 \| 86.3 \| 69.8 \| 1509.1 \| 67.4 \|
	\| LLaVA-v1.5 \| Vicuna-13B\| 80.0 \| 63.3 \| 61.3 \| 53.6 \| 85.9 \| 71.6 \| 1531.3 \| 67.7 \|
	\| [MoVE-KD-v1.0](https://huggingface.co/jiaojuncao/MoVE-KD-13b-v1.0) \| Vicuna-13B\| 80.6 \| 64.2 \| 59.7 \| 55.7 \| 85.7 \| 73.2 \| 1568.1 \| 70.2 \|
	\| [MoVE-KD-v1.1](https://huggingface.co/jiaojuncao/MoVE-KD-13b-v1.1) \| Vicuna-13B\| 80.8 \| 63.9 \| 61.1 \| 57.5 \| 86.3 \| 71.8 \| 1568.3 \| 69.7 \|