jiaojuncao commited on
Commit
da8c6ad
·
verified ·
1 Parent(s): 0d93bab

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -9,8 +9,12 @@ library_name: transformers
9
 
10
  ## Overview
11
 
 
 
12
  Visual encoders are fundamental components in vision-language models (VLMs), each showcasing unique strengths derived from various pre-trained visual foundation models. To leverage the various capabilities of these encoders, recent studies incorporate multiple encoders within a single VLM, leading to a considerable increase in computational cost. In this paper, we present Mixture-of-Visual-Encoder Knowledge Distillation (MoVE-KD), a novel framework that distills the unique proficiencies of multiple vision encoders into a single, efficient encoder model. Specifically, to mitigate conflicts and retain the unique characteristics of each teacher encoder, we employ low-rank adaptation (LoRA) and mixture-of-experts (MoEs) to selectively activate specialized knowledge based on input features, enhancing both adaptability and efficiency. To regularize the KD process and enhance performance, we propose an attention-based distillation strategy that adaptively weighs the different visual encoders and emphasizes valuable visual tokens, reducing the burden of replicating comprehensive but distinct features from multiple teachers.
13
 
 
 
14
  ## MoVE-KD Weights
15
  | **Method** | **LLM** | **VQAv2** | **GQA** | **TextVQA** | **VizWiz** | **POPE** | **SQA** | **MME** | **MMB** |
16
  | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
@@ -19,6 +23,4 @@ Visual encoders are fundamental components in vision-language models (VLMs), eac
19
  | [MoVE-KD-v1.1](https://huggingface.co/jiaojuncao/MoVE-KD-7b-v1.1) | Vicuna-7B| 79.9 | 63.9 | 59.6 | 52.7 | 86.3 | 69.8 | 1509.1 | 67.4 |
20
  | LLaVA-v1.5 | Vicuna-13B| 80.0 | 63.3 | 61.3 | 53.6 | 85.9 | 71.6 | 1531.3 | 67.7 |
21
  | [MoVE-KD-v1.0](https://huggingface.co/jiaojuncao/MoVE-KD-13b-v1.0) | Vicuna-13B| 80.6 | 64.2 | 59.7 | 55.7 | 85.7 | 73.2 | 1568.1 | 70.2 |
22
- | [MoVE-KD-v1.1](https://huggingface.co/jiaojuncao/MoVE-KD-13b-v1.1) | Vicuna-13B| 80.8 | 63.9 | 61.1 | 57.5 | 86.3 | 71.8 | 1568.3 | 69.7 |
23
-
24
- Code: https://github.com/hey-cjj/MoVE-KD
 
9
 
10
  ## Overview
11
 
12
+ ![overview](./pipeline.png)
13
+
14
  Visual encoders are fundamental components in vision-language models (VLMs), each showcasing unique strengths derived from various pre-trained visual foundation models. To leverage the various capabilities of these encoders, recent studies incorporate multiple encoders within a single VLM, leading to a considerable increase in computational cost. In this paper, we present Mixture-of-Visual-Encoder Knowledge Distillation (MoVE-KD), a novel framework that distills the unique proficiencies of multiple vision encoders into a single, efficient encoder model. Specifically, to mitigate conflicts and retain the unique characteristics of each teacher encoder, we employ low-rank adaptation (LoRA) and mixture-of-experts (MoEs) to selectively activate specialized knowledge based on input features, enhancing both adaptability and efficiency. To regularize the KD process and enhance performance, we propose an attention-based distillation strategy that adaptively weighs the different visual encoders and emphasizes valuable visual tokens, reducing the burden of replicating comprehensive but distinct features from multiple teachers.
15
 
16
+ Code: https://github.com/hey-cjj/MoVE-KD
17
+
18
  ## MoVE-KD Weights
19
  | **Method** | **LLM** | **VQAv2** | **GQA** | **TextVQA** | **VizWiz** | **POPE** | **SQA** | **MME** | **MMB** |
20
  | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
 
23
  | [MoVE-KD-v1.1](https://huggingface.co/jiaojuncao/MoVE-KD-7b-v1.1) | Vicuna-7B| 79.9 | 63.9 | 59.6 | 52.7 | 86.3 | 69.8 | 1509.1 | 67.4 |
24
  | LLaVA-v1.5 | Vicuna-13B| 80.0 | 63.3 | 61.3 | 53.6 | 85.9 | 71.6 | 1531.3 | 67.7 |
25
  | [MoVE-KD-v1.0](https://huggingface.co/jiaojuncao/MoVE-KD-13b-v1.0) | Vicuna-13B| 80.6 | 64.2 | 59.7 | 55.7 | 85.7 | 73.2 | 1568.1 | 70.2 |
26
+ | [MoVE-KD-v1.1](https://huggingface.co/jiaojuncao/MoVE-KD-13b-v1.1) | Vicuna-13B| 80.8 | 63.9 | 61.1 | 57.5 | 86.3 | 71.8 | 1568.3 | 69.7 |