OpenGVLab
/

Mono-InternVL-2B

Image-Text-to-Text

feature-extraction

Mixture of Experts

Model card Files Files and versions

wzk1015 commited on Oct 11, 2024

Commit

cd529ff

·

verified ·

1 Parent(s): a4c0f7d

Update README.md

Files changed (1) hide show

README.md +3 -2

README.md CHANGED Viewed

@@ -12,6 +12,7 @@ tags:
   - vision
   - ocr
   - custom_code
 ---
 # Mono-InternVL-2B
@@ -30,7 +31,7 @@ tags:
 ## Introduction
-We release Mono-InternVL, a **monolithic** multimodal large language model (MLLM) that integrates visual encoding and textual decoding into a single LLM. In Mono-InternVL, a set of visual experts is embedded into the pre-trained LLM via a mixture-of-experts mechanism. By freezing the LLM, Mono-InternVL ensures that visual capabilities are optimized without compromising the pre-trained language knowledge. Based on this structure, an innovative Endogenous Visual Pretraining (EViP) is introduced to realize coarse-to-fine visual learning.
@@ -38,7 +39,7 @@ Mono-InternVL achieves superior performance compared to state-of-the-art MLLM Mi
-This repository contains the instruction-tuned Mono-InternVL-2B model. It is built upon [internlm2-chat-1_8b](https://huggingface.co/internlm/internlm2-chat-1_8b). For more details, please refer to our [paper](https://arxiv.org/abs/2410.08202).

   - vision
   - ocr
   - custom_code
+  - moe
 ---
 # Mono-InternVL-2B
 ## Introduction
+We release Mono-InternVL, a **monolithic** multimodal large language model (MLLM) that integrates visual encoding and textual decoding into a single LLM. In Mono-InternVL, a set of visual experts is embedded into the pre-trained LLM via a mixture-of-experts (MoE) mechanism. By freezing the LLM, Mono-InternVL ensures that visual capabilities are optimized without compromising the pre-trained language knowledge. Based on this structure, an innovative Endogenous Visual Pretraining (EViP) is introduced to realize coarse-to-fine visual learning.
+This repository contains the instruction-tuned Mono-InternVL-2B model, which has 1.8B activated parameters (3B in total). It is built upon [internlm2-chat-1_8b](https://huggingface.co/internlm/internlm2-chat-1_8b). For more details, please refer to our [paper](https://arxiv.org/abs/2410.08202).