# InternVL Documentation

Release 2.0 — LMDeploy Contributors

Last updated: May 29, 2025

## Get Started

### Installation

- Clone the repository:

```
git clone https://github.com/OpenGVLab/InternVL.git
cd InternVL
```

- Create and activate a conda environment:

```
conda create -n internvl python=3.10
conda activate internvl
```

- Install core dependencies:

```
pip install -r requirements.txt
```

By default, `requirements.txt` includes:

- `-r requirements/internvl_chat.txt`
- `-r requirements/streamlit_demo.txt`
- `-r requirements/classification.txt`
- `-r requirements/segmentation.txt`

The `clip_benchmark.txt` is not included by default. If you need zero-shot classification/retrieval evaluation:

```
pip install -r requirements/clip_benchmark.txt
```

### Optional Dependencies

- flash-attn 2.3.6 (for training chat models):

```
pip install flash-attn==2.3.6 --no-build-isolation
```

Or build from source:

```
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v2.3.6
python setup.py install
```

- mmcv-full 1.6.2 (for segmentation):

```
pip install -U openmim
mim install mmcv-full==1.6.2
```

- NVIDIA Apex (for segmentation):

```
git clone https://github.com/NVIDIA/apex.git
cd apex
git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation \
  --config-settings "--build-option=--cpp_ext" \
  --config-settings "--build-option=--cuda_ext" .
```

Note: If you encounter `ModuleNotFoundError: No module named 'fused_layer_norm_cuda'`, Apex CUDA extensions were not installed. You can uninstall Apex to fall back to PyTorch RMSNorm, or patch `setup.py` and rebuild.
## Chat Data Format

### Dataset Configuration

In InternVL 2.0 and 2.5, the organization of the training data is controlled by several key parameters to optimize the
balance and distribution of datasets during training.

- Data augmentation: JPEG compression is applied conditionally — enabled for image datasets to enhance robustness,
  and disabled for video datasets to maintain consistent frame quality.
- Maximum tile number: The parameter `n_max` controls the maximum tiles per dataset. Higher values (24–36) are
  used for multi-image or high-resolution data; lower values (6–12) for standard images; and 1 for videos.
- Repeat factor: The repeat factor `r` adjusts dataset sampling frequency. Values below 1 reduce a dataset’s weight,
  while values above 1 increase it. This helps balance training across tasks and prevents overfitting or underfitting.

### Meta File

In this document, we will detail the organization format of our conversation data. Currently, we use a JSON file to
manage the meta information of all datasets. The format is as follows:

```json
{
"your-custom-dataset-1": {
"root": "path/to/the/image/",
"annotation":"path/to/the/jsonl/annotation",
"data_augment": false,
"max_dynamic_patch": 12 ,
"repeat_time": 1 ,
"length":"number of samples in the dataset"
},
...
}
```

Here, root is the root directory of the dataset, annotation is the path to the annotationle, data_augment indicates
whether data augmentation is needed, repeat_time is the number of times the dataset is repeated, and length is the
number of samples in the dataset.

For example, a file for the ShareGPT4V dataset looks like this:

```json
{
"sharegpt4v_instruct_gpt4-vision_cap100k": {
"root": "playground/data/",
"annotation":"playground/opensource/sharegpt4v_instruct_gpt4-vision_cap100k.jsonl",
"data_augment": false,
"max_dynamic_patch": 12 ,
"repeat_time": 1 ,
"length": 102025
},
...
}
```

You can add multiple datasets in this JSON file, similar to this file.

We currently support the following four types of datasets: pure text data, single-image data, multi-image
(interleaved) data, and video data. We do not require all entries in a JSONL le to be of the same type,
meaning your JSONLle can contain dierent types of data.

### Pure Text Data

For pure text data, we use a JSONL file to store the data. Each entry is a dictionary organized in the following format.
Note that entries for pure text data should not contain an image field.

```json
{
"id": 0 ,
"conversations": [
{"from":"human","value":"user input"},
{"from":"gpt", "value": "assistant output"},
{"from":"human","value":"user input"},
{"from":"gpt", "value": "assistant output"}
]
}
```

Here, id is the unique identifier for the data, and conversations is a list containing multiple conversations. Each
conversation is a dictionary with two key-value pairs: from and value. from indicates the speaker’s identity, and
value indicates the speaker’s input.

This data format is consistent with LLaVA 1.5, with the difference being that LLaVA 1.5 uses JSON, while we use
JSONL.

Example:

```json
{
"id": 0 ,
"conversations": [
{
"from":"human",
"value":"Find the length of the hypotenuse given two sides\n\"side_length1 = 5,␣
˓→side_length2 = 6\""
},
{
"from":"gpt",
"value":"To find the length of the hypotenuse given the two sides of a right␣
˓→triangle, you can use the Pythagorean theorem. The theorem states that the square of␣
˓→the length of the hypotenuse is equal to the sum of the squares of the lengths of the␣
˓→two other sides.\n\nIn this case, the length of one side is 5, and the length of the␣
˓→other side is 6. Plugging these values into the formula, we get:\n\nhypotenuse^2 = 5^2␣
˓→+ 6^2\n = 25 + 36\n = 61\n\nTo get the length of the␣
˓→hypotenuse, take the square root of 61:\n\nhypotenuse = 61\n = 7.81␣
˓→(rounded to two decimal places)\n\nTherefore, the length of the hypotenuse is␣
˓→approximately 7.81 units."
}
```
]
}
```

### Single-Image Data

For single-image data, we use a JSONL file to store the data. Each entry is a dictionary organized in the following
format. Note that each entry for single-image data must contain an image field, which is a string.

The path in the image field is relative to the root field. Concatenating the root field and the image field gives the
complete path to the image. It is recommended to include width and height information for each data sample for
future use.

```json
{
  "id": 0,
  "image": "path/to/image.jpg",
  "width": 111,
  "height": 222,
  "conversations": [
    {"from": "human", "value": "<image>\nuser input"},
    {"from": "gpt", "value": "assistant output"},
    {"from": "human", "value": "user input"},
    {"from": "gpt", "value": "assistant output"}
  ]
}
```

Here, <image> indicates the position where the image is inserted, and the number of <image> placeholders should
match the number of images. In single-image data, the <image> placeholder should appear only once across all conversations.

An example of single-image data:

```json
{
  "id": 0,
  "image": "images/00000000.jpg",
  "width": 897,
  "height": 1152,
  "conversations": [
    {
      "from": "human",
      "value": "<image>\nCan you extract any readable text from the image?"
    },
    {
      "from": "gpt",
      "value": "Dares Wins Vol. 5 Tommy's Heroes Vol. 6: For Tomorrow Vol. 7: Closing Time miniseries. Clark Kent is being interviewed about Superman's connection to
˓→notorious killer Tommy Monaghan. Taking the conversation..."
}
]
}

### Grounding / Detection Data

For grounding or detection data, we normalize each coordinate box to the range [0, 1000] using the width and height
of the image, and round the normalized coordinates to integers:

```python
def normalize_coordinates(box, image_width, image_height):
    x1, y1, x2, y2 = box
    normalized_box = [
        round((x1 / image_width) * 1000),
        round((y1 / image_height) * 1000),
        round((x2 / image_width) * 1000),
        round((y2 / image_height) * 1000),
    ]
    return normalized_box
```


We organize the model’s responses in the following format: <ref>class name</ref><box>[[x1, y1, x2, y2],
...]</box>

An example of the RefCOCO dataset:

```json

Here is the more readable version:

```json
{
"id": 2324 ,
"image": "COCO_train2014_000000581857.jpg",
"width": 427 ,
"height": 640 ,
"conversations": [
{
"from":"human",
"value":"<image>\nPlease provide the bounding box coordinate of the region this␣
˓→sentence describes: <ref>the lady with the blue shirt</ref>"
},
{
"from":"gpt",
"value":"<ref>the lady with the blue shirt</ref><box>[[243, 469, 558, 746]]</box>"
},
{
"from":"human",
"value":"Please provide the bounding box coordinate of the region this sentence␣
˓→describes: <ref>woman in gray shirt facing camera on right</ref>"
},
{
"from":"gpt",
"value":"<ref>woman in gray shirt facing camera on right</ref><box>[[507, 409,␣
˓→698, 740]]</box>"
},
{
"from":"human",
"value":"Please provide the bounding box coordinate of the region this sentence␣
˓→describes: <ref>a woman in glasses shops in an open air fruit market</ref>"
},
{
"from":"gpt",
"value":"<ref>a woman in glasses shops in an open air fruit market</ref><box>
˓→[[196, 422, 395, 708]]</box>"
 
 
}
]
}
```

An example of the V3Det dataset:

{"id": 78281 ,"image":"images/x00001541/000106464.jpg","width": 800 ,"height": 800 ,
˓→"conversations": [{"from": "human", "value": "<image>\nPlease detect and label all␣
˓→objects in the following image and mark their positions."}, {"from":"gpt","value":
˓→"Sure, I will detect and label all objects in the image and mark their positions.\n\
˓→n\n<ref>Bulletproof Helmet</ref><box>[[1, 2, 269, 235]]</box>\n<ref>Bulletproof␣
˓→Clothing</ref><box>[[650, 619, 990, 993]]</box>\n<ref>Gun Model</ref><box>[[32, 231,␣
˓→977, 662]]</box>\n<ref>screw</ref><box>[[754, 376, 851, 429]]</box>\n<ref>handcuff</
˓→ref><box>[[698, 228, 931, 386]]</box>\n<ref>95 Type Assault Rifle</ref><box>[[39, 229,␣
˓→983, 667]]</box>\n<ref>shield</ref><box>[[30, 597, 273, 993]]</box>\n<ref>telescope</
˓→ref><box>[[666, 38, 890, 217]]</box>\n<ref>Wireless Walkie-Talkie</ref><box>[[295, 2,␣
˓→370, 226], [374, 0, 447, 226]]</box>\n<ref>bomb</ref><box>[[473, 61, 552, 181], [569,␣
˓→61, 648, 183]]</box>\n<ref>weapon</ref><box>[[302, 617, 342, 993]]</box>\n<ref>vessel</
˓→ref><box>[[355, 653, 644, 991]]</box>\n<ref>artifact</ref><box>[[915, 0, 981, 294]]</
˓→box>\n\n"}]}

Here is the more readable version:

```json
{
"id": 78281 ,
"image": "images/x00001541/000106464.jpg",
"width": 800 ,
"height": 800 ,
"conversations": [
{
"from":"human",
"value":"<image>\nPlease detect and label all objects in the following image and␣
˓→mark their positions."
},
{
"from":"gpt",
"value":"Sure, I will detect and label all objects in the image and mark their␣
˓→positions.\n\n<ref>Bulletproof Helmet</ref><box>[[1, 2, 269, 235]]</box>\n<ref>
˓→Bulletproof Clothing</ref><box>[[650, 619, 990, 993]]</box>\n<ref>Gun Model</ref><box>
˓→[[32, 231, 977, 662]]</box>\n<ref>screw</ref><box>[[754, 376, 851, 429]]</box>\n<ref>
˓→handcuff</ref><box>[[698, 228, 931, 386]]</box>\n<ref>95 Type Assault Rifle</ref><box>
˓→[[39, 229, 983, 667]]</box>\n<ref>shield</ref><box>[[30, 597, 273, 993]]</box>\n<ref>
˓→telescope</ref><box>[[666, 38, 890, 217]]</box>\n<ref>Wireless Walkie-Talkie</ref><box>
˓→[[295, 2, 370, 226], [374, 0, 447, 226]]</box>\n<ref>bomb</ref><box>[[473, 61, 552,␣
˓→181], [569, 61, 648, 183]]</box>\n<ref>weapon</ref><box>[[302, 617, 342, 993]]</box>\n
˓→<ref>vessel</ref><box>[[355, 653, 644, 991]]</box>\n<ref>artifact</ref><box>[[915, 0,␣
˓→981, 294]]</box>\n"
}
]
}
```


**1.2.5 Multi-Image Data**

For multi-image data, we use a JSONL le to store the data. Each entry is a dictionary organized in the following
format. Note that each entry for multi-image data must contain an imageeld, which is a list of strings.

Each element in the list is a path relative to the rooteld. Concatenating the rooteld and each element gives the
complete path to the images. It is recommended to include width_list and height_list information for each data
sample for future use.

{
"id": 0 ,
"image": ["path/to/image1.jpg", "path/to/image2.jpg", "path/to/image3.jpg"],
"width_list": [ 111 , 222 , 333 ],
"height_list": [ 111 , 222 , 333 ],
"conversations": [
{"from":"human","value":"<image>\nuser input <image>\nuser input"},
{"from":"gpt", "value": "assistant output"},
{"from":"human","value":"<image>\nuser input"},
{"from":"gpt", "value": "assistant output"}
]
}

Here, <image> indicates the position where the images are inserted, and the number of <image> placeholders should
match the number of images. In this example, the imageeld list contains three elements, so the <image> placeholder
also needs to appear three times.

An example of multi-image data:

{"id": 0 ,"image": ["cimages/multimages/16/5pc.png","cimages/multimages/16/5pd.png",
˓→"cimages/multimages/16/1602207874_p5b.png", "cimages/multimages/16/5pe.png","cimages/
˓→multimages/16/1473016381_p5a.png"], "height_list": [ 23 , 22 , 23 , 41 , 52 ],"width_list":␣
˓→[ 240 , 240 , 240 , 240 , 240 ], "conversations": [{"from": "human","value":"Let F = {2, 5,
˓→7, 9}\n\nLet G = {1, 4, 6, 8}\n\nWhich of the following is true?\nA. \n<image>\n\nB. /
˓→\n<image>\n\nC. /\n<image>\n\nD. /\n<image>\n\nE. /\n<image>\n\nAnswer with the option
˓→'s letter from the given choices directly."}, {"from":"gpt", "value": "A"}]}

Here is the more readable version:

{
"id": 0 ,
"image": [
"cimages/multimages/16/5pc.png",
"cimages/multimages/16/5pd.png",
"cimages/multimages/16/1602207874_p5b.png",
"cimages/multimages/16/5pe.png",
"cimages/multimages/16/1473016381_p5a.png"
],
"height_list": [ 23 , 22 , 23 , 41 , 52 ],
"width_list": [ 240 , 240 , 240 , 240 , 240 ],
"conversations": [
{
"from":"human",
"value":"Let F = {2, 5, 7, 9}\n\nLet G = {1, 4, 6, 8}\n\nWhich of the following␣
˓→is true?\nA. \n<image>\n\nB. /\n<image>\n\nC. /\n<image>\n\nD. /\n<image>\n\nE. /\n
˓→<image>\n\nAnswer with the option's letter from the given choices directly."
(continues on next page)

**10 Chapter 1. Documentation**


(continued from previous page)
},
{
"from":"gpt",
"value":"A"
}
]
}

**1.2.6 Video Data**

For video data, we use a JSONL le to store the data. Each entry is a dictionary organized in the following format.
Note that each entry for video data must contain a videoeld, which is a string.

The path in the videoeld is relative to the rooteld. Concatenating the rooteld and the videoeld gives the
complete path to the video.

{
"id": 0 ,
"video": "path/to/video.mp4",
"conversations": [
{"from":"human","value":"<video>\nuser input"},
{"from":"gpt", "value": "assistant output"},
{"from":"human","value":"user input"},
{"from":"gpt", "value": "assistant output"}
]
}

Here, <video> indicates the position where the video is inserted, and the number of <video> placeholders should
match the number of videos. In video data, the <video> placeholder should appear only once across all conversations.

An example of video data:

{"id": 0 ,"video":"000001_000050/1066682446.mp4", "conversations": [{"from":"human",
˓→"value": "<video>\nWrite an exhaustive depiction of the given video, capturing its␣
˓→essence and key moments."}, {"from":"gpt", "value": "First, the video shows a black␣
˓→clock with black numbers and a white background on the side of a building. Next, the␣
˓→numbers show the time. Then, the clock is shown in black and white, but the numbers␣
˓→still show the time. The clock now displays the time as 830. Finally, a black clock␣
˓→with numbers is on the side of a tower, displaying the date as January 28. Overall,␣
˓→the video is a 3D animation of a flip clock calendar for the date of January 28,␣
˓→showing the different stages of the clock's display as it flips through the time and␣
˓→date."}]}

Here is the more readable version:

{
"id": 0 ,
"video": "000001_000050/1066682446.mp4",
"conversations": [
{
"from":"human",
"value":"<video>\nWrite an exhaustive depiction of the given video, capturing its␣
˓→essence and key moments."
(continues on next page)

**1.2. Chat Data Format 11**


(continued from previous page)
},
{
"from":"gpt",
"value":"First, the video shows a black clock with black numbers and a white␣
˓→background on the side of a building. Next, the numbers show the time. Then, the clock␣
˓→is shown in black and white, but the numbers still show the time. The clock now␣
˓→displays the time as 8:30. Finally, a black clock with numbers is on the side of a␣
˓→tower, displaying the date as January 28. Overall, the video is a 3D animation of a␣
˓→flip clock calendar for the date of January 28, showing the different stages of the␣
˓→clock's display as it flips through the time and date."
}
]
}

We currently do not support datasets that interleave video, image, and text or datasets with multiple videos. Support
for such datasets may be added in the future if there is a need.

## 1.3 Evaluation Data Preparation

```
COCOimages are used in VQAv2, OK-VQA, RefCOCO, POPE, and so on. Make sure you have already
downloaded COCO images before evaluating on these benchmarks.
```
**1.3.1 Image Captioning**

**COCO**

Follow the instructions below to prepare the data

# Step 1: Create the data directory
mkdir -p data/coco&& cd data/coco

# Step 2: Download and unzip image files
wget [http://images.cocodataset.org/zips/train2014.zip&&](http://images.cocodataset.org/zips/train2014.zip&&) unzip train2014.zip
wget [http://images.cocodataset.org/zips/val2014.zip&&](http://images.cocodataset.org/zips/val2014.zip&&) unzip val2014.zip
wget [http://images.cocodataset.org/zips/test2015.zip&&](http://images.cocodataset.org/zips/test2015.zip&&) unzip test2015.zip

# Step 3: Download and place the annotation files
mkdir -p annotations&& cdannotations/
wget https://github.com/OpenGVLab/InternVL/releases/download/data/coco_karpathy_test.json
wget https://github.com/OpenGVLab/InternVL/releases/download/data/coco_karpathy_test_gt.
˓→json

cd ../../..

After preparation is complete, the directory structure is:

data/coco
annotations
coco_karpathy_test.json
coco_karpathy_test_gt.json
train
val
test

**12 Chapter 1. Documentation**


**Flickr30K**

Follow the instructions below to prepare the data

# Step 1: Create the data directory
mkdir -p data/flickr30k &&cd data/flickr30k

# Step 2: Download and unzip image files
# Download images from https://bryanplummer.com/Flickr30kEntities/

# Step 3: Download and place the annotation files
# Karpathy split annotations can be downloaded from the following link:
wget https://github.com/mehdidc/retrieval_annotations/releases/download/1.0.0/flickr30k_
˓→test_karpathy.txt
# This file is provided by the clip-benchmark repository.
# We convert this txt file to json format, download the converted file:
wget https://github.com/OpenGVLab/InternVL/releases/download/data/flickr30k_test_
˓→karpathy.json

cd ../..

After preparation is complete, the directory structure is:

data/flickr30k
Images
flickr30k_test_karpathy.txt
flickr30k_test_karpathy.json

**NoCaps**

Follow the instructions below to prepare the data

# Step 1: Create the data directory
mkdir -p data/nocaps&& cddata/nocaps

# Step 2: Download and unzip image files
# Download images from https://nocaps.org/download

# Step 3: Download and place the annotation files
# Original annotations can be downloaded from https://nocaps.s3.amazonaws.com/nocaps_val_
˓→4500_captions.json
wget https://nocaps.s3.amazonaws.com/nocaps_val_4500_captions.json

cd ../..

After preparation is complete, the directory structure is:

data/nocaps
images
nocaps_val_4500_captions.json

**1.3. Evaluation Data Preparation 13**


**1.3.2 Reasoning & Mathematics**

**MMMU**

```
Note: While our codebase can run the benchmark, we recommend usingVLMEvalKitfor testing this
benchmark if you aim to align results with our technical report.
```
The evaluation script will automatically download the MMMU dataset from HuggingFace.

**MMMU-Pro**

The evaluation script will automatically download the MMMU-Pro dataset from HuggingFace.

**MathVista**

```
Note: While our codebase can run the benchmark, we recommend usingVLMEvalKitfor testing this
benchmark if you aim to align results with our technical report.
```
Follow the instructions below to prepare the data

# Step 1: Create the data directory
mkdir -p data/MathVista &&cd data/MathVista

# Step 2: Download the annotation
wget https://huggingface.co/datasets/AI4Math/MathVista/raw/main/annot_testmini.json

cd ../..

After preparation is complete, the directory structure is:

MathVista
annot_testmini.json

**1.3.3 OCR & Chart & Document**

**AI2D**

Follow the instructions below to prepare the data

# Step 1: Create the data directory
mkdir -p data/ai2diagram&&cd data/ai2diagram

# Step 2: Download converted files
wget https://huggingface.co/OpenGVLab/InternVL/raw/main/ai2d_test_vlmevalkit.jsonl -O␣
˓→test_vlmevalkit.jsonl
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/AI2D_TEST.zip&& unzip AI2D_
˓→TEST.zip

# Step 3: Download images from Google Drive (optional, provided by InternLM-XComposer)
# https://drive.google.com/file/d/1dqqa3MnrxMXaU_K9JA6C83je32ibwdOY/view?usp=sharing
# images should be placed in`data/ai2diagram/ai2d/abc_images` and`data/ai2diagram/ai2d/
˓→images`

cd ../..

After preparation is complete, the directory structure is:

**14 Chapter 1. Documentation**


data/ai2diagram
test_vlmevalkit.jsonl
ai2d # (optional)
abc_images
images
AI2D_TEST

**ChartQA**

Follow the instructions below to prepare the data

# Step 1: Create the data directory
mkdir -p data/chartqa&& cddata/chartqa

# Step 2: Download images from
# https://drive.google.com/file/d/1Lm_w6zeET1Hyl_9ks6w5nEsgpoyPHalV/view

# Step 3: Download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/train_
˓→human.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/train_
˓→augmented.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/test_
˓→human.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/test_
˓→augmented.jsonl

cd ../..

**TextVQA**

Follow the instructions below to prepare the data

# Step 1: Create the data directory
mkdir -p data/textvqa&& cddata/textvqa

# Step 2: Download images
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip &&unzip train_
˓→val_images.zip

# Step 3: Download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/
˓→textvqa_train_annotations.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/
˓→textvqa_train_questions.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/
˓→textvqa_train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/
˓→textvqa_val_annotations.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/textvqa/
˓→textvqa_val_questions.json
wget https://huggingface.co/OpenGVLab/InternVL/raw/main/textvqa_val.jsonl
wget https://huggingface.co/OpenGVLab/InternVL/raw/main/textvqa_val_llava.jsonl
(continues on next page)

**1.3. Evaluation Data Preparation 15**


```
(continued from previous page)
```
cd ../..

After preparation is complete, the directory structure is:

data/textvqa
TextVQA_Rosetta_OCR_v0.2_test.json
TextVQA_Rosetta_OCR_v0.2_train.json
TextVQA_Rosetta_OCR_v0.2_val.json
textvqa_train_annotations.json
textvqa_train.jsonl
textvqa_train_questions.json
textvqa_val_annotations.json
textvqa_val.jsonl
textvqa_val_llava.jsonl
textvqa_val_questions.json
train_images

After preparation is complete, the directory structure is:

data/chartqa
ChartQA Dataset
test
train
val
test_augmented.jsonl
test_human.jsonl
train_augmented.jsonl
train_human.jsonl

**DocVQA**

Follow the instructions below to prepare the data

# Step 1: Create the data directory
mkdir -p data/docvqa&& cddata/docvqa

# Step 2: Download images and annotations
wget https://datasets.cvc.uab.es/rrc/DocVQA/train.tar.gz --no-check-certificate#␣
˓→(optional)
wget https://datasets.cvc.uab.es/rrc/DocVQA/val.tar.gz --no-check-certificate
wget https://datasets.cvc.uab.es/rrc/DocVQA/test.tar.gz --no-check-certificate

# Step 3: Unzip files
tar -zxvf train.tar.gz
tar -zxvf val.tar.gz
tar -zxvf test.tar.gz

# Step 4: Download converted jsonl files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/train.
˓→jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/val.
(continues on next page)

**16 Chapter 1. Documentation**


(continued from previous page)
˓→jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/test.
˓→jsonl

cd ../..

After preparation is complete, the directory structure is:

data/docvqa
test
test.jsonl
train
train.jsonl
val
val.jsonl

**InfoVQA**

Follow the instructions below to prepare the data

# Step 1: Create the data directory
mkdir -p data/infographicsvqa&& cddata/infographicsvqa

# Step 2: Download images and annotations from https://rrc.cvc.uab.es/?ch=17&
˓→com=downloads
# infographicsVQA_test_v1.0.json, infographicsVQA_val_v1.0_withQT.json, infographicVQA_
˓→train_v1.0.json

# Step 3: Download converted files
wget https://huggingface.co/OpenGVLab/InternVL/raw/main/infographicsvqa_val.jsonl -O val.
˓→jsonl
wget https://huggingface.co/OpenGVLab/InternVL/raw/main/infographicsvqa_test.jsonl -O␣
˓→test.jsonl

cd ../..

After preparation is complete, the directory structure is:

data/infographicsvqa
infographicsvqa_images
infographicsVQA_test_v1.0.json
infographicsVQA_val_v1.0_withQT.json
infographicVQA_train_v1.0.json
test.jsonl
val.jsonl

### OCRVQA

Follow the instructions below to prepare the data

# Step 1: Create the data directory
mkdir -p data/ocrvqa&& cddata/ocrvqa
(continues on next page)

**1.3. Evaluation Data Preparation 17**


```
(continued from previous page)
```
# Step 2: Download images by following instructions at
# https://ocr-vqa.github.io/kvqa_ProjectFiles/README.txt

# Step 3: Download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ocrvqa/ocrvqa_
˓→train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ocrvqa/ocrvqa_
˓→val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/ocrvqa/ocrvqa_
˓→test.jsonl

cd ../..

After preparation is complete, the directory structure is:

data/ocrvqa
images
ocrvqa_test.jsonl
ocrvqa_train.jsonl
ocrvqa_val.jsonl

**1.3.4 Multi-Image**

**Mantis-Eval**

The evaluation script will automatically download the Mantis Eval dataset from HuggingFace.

### MMIU

Follow the instructions below to prepare the data:

# Step 1: Create the data directory
mkdir -p data/mmiu&& cd data/mmiu

# Step 2: Download images
wget https://huggingface.co/MMIUBenchmark/MMIU/resolve/main/2D-spatial.zip
wget https://huggingface.co/MMIUBenchmark/MMIU/resolve/main/3D-spatial.zip
unzip 2D-spatial.zip
unzip 3D-spatial.zip

cd ../..

After preparation is complete, the directory structure is:

data/mmiu
2D-spatial
3D-spatial

**18 Chapter 1. Documentation**


### MIRB

Follow the instructions below to prepare the data:

# Step 1: Download annotation files
cd data/
GIT_LFS_SKIP_SMUDGE= 1 git clone https://huggingface.co/datasets/VLLMs/MIRB

# Step 2: Download and unzip the image files
cd MIRB/&& rm -rf images.zip
wget https://huggingface.co/datasets/VLLMs/MIRB/resolve/main/images.zip
unzip images.zip

cd ../../

After preparation is complete, the directory structure is:

data/MIRB
images
...
visual_chain.json
visual_chain_concat.json

**1.3.5 Comprehensive**

**MME**

Follow the instructions below to prepare the data

# Step 1: Create the data directory
mkdir -p data/mme&& cd data/mme

# Step 2: Download MME_Benchmark_release_version.zip
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/MME_Benchmark_release_
˓→version.zip
unzip MME_Benchmark_release_version.zip

cd ../..

After preparation is complete, the directory structure is:

data/mme
MME_Benchmark_release_version

**MMBench & CCBench**

Follow the instructions below to prepare the data

# Step 1: Create the data directory
mkdir -p data/mmbench&& cddata/mmbench

# Step 2: Download csv files
wget [http://opencompass.openxlab.space/utils/MMBench/CCBench_legacy.tsv](http://opencompass.openxlab.space/utils/MMBench/CCBench_legacy.tsv)
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_
(continues on next page)

**1.3. Evaluation Data Preparation 19**


(continued from previous page)
˓→20230712.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_cn_
˓→20231003.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_en_
˓→20231003.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_test_cn_
˓→20231003.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_test_en_
˓→20231003.tsv

cd ../..

After preparation is complete, the directory structure is:

data/mmbench
CCBench_legacy.tsv
mmbench_dev_20230712.tsv
mmbench_dev_cn_20231003.tsv
mmbench_dev_en_20231003.tsv
mmbench_test_cn_20231003.tsv
mmbench_test_en_20231003.tsv

**MMVet**

```
Note: While our codebase can run the benchmark, we recommend usingVLMEvalKitfor testing this
benchmark if you aim to align results with our technical report.
```
Follow the instructions below to prepare the data

# Step 1: Create the data directory
mkdir -p data/mm-vet&& cddata/mm-vet

# Step 2: Download the dataset
wget https://github.com/yuweihao/MM-Vet/releases/download/v1/mm-vet.zip
unzip mm-vet.zip
wget https://huggingface.co/OpenGVLab/InternVL/raw/main/llava-mm-vet.jsonl
cd ../..

After preparation is complete, the directory structure is:

data/mm-vet
images
llava-mm-vet.jsonl

**MMVet v2**

Follow the instructions below to prepare the data:

# Step 1: Create the data directory
mkdir -p data/mm-vet-v2 &&cd data/mm-vet-v2

# Step 2: Download the dataset
(continues on next page)

**20 Chapter 1. Documentation**


```
(continued from previous page)
```
wget https://github.com/yuweihao/MM-Vet/releases/download/v2/mm-vet-v2.zip
unzip mm-vet-v2.zip

cd ../..

After preparation is complete, the directory structure is:

data/mm-vet-v2
images
mm-vet-v2.json

**1.3.6 Hallucination**

**MMHal-Bench**

Follow the instructions below to prepare the data:

# Step 1: Create the data directory
mkdir -p data/mm-halbench&& cddata/mm-halbench

# Step 2: Download the`mmhal-bench_with_image.jsonl` file
# This file is provided by RLAIF-V
# See here: https://github.com/RLHF-V/RLAIF-V/blob/main/README.md#mmhal-bench
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/mmhal-bench_with_image.jsonl

cd ../..

After preparation is complete, the directory structure is:

data/mm-halbench
mmhal-bench_with_image.jsonl

### POPE

Follow the instructions below to prepare the data

# Step 1: Create the data directory
mkdir -p data/pope&& cd data/pope

# Step 2: Make sure you have downloaded COCO images
ln -s ../coco/val2014 ./
wget https://github.com/OpenGVLab/InternVL/releases/download/data/llava_pope_test.jsonl

# Step 3: Download`coco` from POPE
mkdir -p coco&& cd coco
wget https://github.com/AoiDragon/POPE/raw/e3e39262c85a6a83f26cf5094022a782cb0df58d/
˓→output/coco/coco_pope_adversarial.json
wget https://github.com/AoiDragon/POPE/raw/e3e39262c85a6a83f26cf5094022a782cb0df58d/
˓→output/coco/coco_pope_popular.json
wget https://github.com/AoiDragon/POPE/raw/e3e39262c85a6a83f26cf5094022a782cb0df58d/
˓→output/coco/coco_pope_random.json
cd ../../..

**1.3. Evaluation Data Preparation 21**


After preparation is complete, the directory structure is:

data/pope
coco
coco_pope_adversarial.json
coco_pope_popular.json
coco_pope_random.json
llava_pope_test.jsonl
val2014

**1.3.7 Visual Grounding**

**RefCOCO Series**

Follow the instructions below to prepare the data

# Step 1: Create the data directory
mkdir -p data/refcoco&& cddata/refcoco

# Step 2: Download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/
˓→refcoco_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/
˓→refcoco_testA.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco/
˓→refcoco_testB.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/
˓→refcoco%2B_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/
˓→refcoco%2B_testA.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcoco%2B/
˓→refcoco%2B_testB.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcocog/
˓→refcocog_val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/refcocog/
˓→refcocog_test.jsonl

cd ../..

After preparation is complete, the directory structure is:

data/refcoco
refcocog_test.jsonl
refcocog_val.jsonl
refcoco_testA.jsonl
refcoco+_testA.jsonl
refcoco_testB.jsonl
refcoco+_testB.jsonl
refcoco_val.jsonl
refcoco+_val.jsonl

**22 Chapter 1. Documentation**


**1.3.8 Video**

**MVBench**

Follow the instructions below to prepare the data:

# Step 1: Download the dataset
cd data/
huggingface-cli download --repo-type dataset --resume-download OpenGVLab/MVBench --local-
˓→dir MVBench --local-dir-use-symlinks False

# Step 2: Unzip videos
cd MVBench/video/
forfilein *.zip;do unzip"$file"-d "${file%.*}"; done
cd ../../..

After preparation is complete, the directory structure is:

data/MVBench
json
action_antonym.json
action_count.json
action_localization.json
action_prediction.json
action_sequence.json
character_order.json
counterfactual_inference.json
egocentric_navigation.json
episodic_reasoning.json
fine_grained_action.json
fine_grained_pose.json
moving_attribute.json
moving_count.json
moving_direction.json
object_existence.json
object_interaction.json
object_shuffle.json
scene_transition.json
state_change.json
unexpected_action.json
README.md
video
clevrer
FunQA_test
Moments_in_Time_Raw
nturgbd
perception
scene_qa
ssv2_video
sta
star
tvqa
vlnqa

**1.3. Evaluation Data Preparation 23**


**1.3.9 General VQA**

**VQAv2**

Follow the instructions below to prepare the data

# Step 1: Create the data directory
mkdir -p data/vqav2 &&cd data/vqav2

# Step 2: Make sure you have downloaded COCO images
ln -s ../coco/train2014 ./
ln -s ../coco/val2014 ./
ln -s ../coco/test2015 ./

# Step 3: Download questions and annotations
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Train_mscoco.zip&&␣
˓→unzip v2_Annotations_Train_mscoco.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Train_mscoco.zip &&␣
˓→unzip v2_Questions_Train_mscoco.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Val_mscoco.zip &&␣
˓→unzip v2_Annotations_Val_mscoco.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Val_mscoco.zip&& unzip␣
˓→v2_Questions_Val_mscoco.zip
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Test_mscoco.zip&& unzip␣
˓→v2_Questions_Test_mscoco.zip

# Step 4: Download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vqav2/vqav2_
˓→train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vqav2/vqav2_
˓→val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vqav2/vqav2_
˓→testdev.jsonl

cd ../..

After preparation is complete, the directory structure is:

data/vqav2
train2014 -> ../coco/train2014
val2014 -> ../coco/val2014
test2015 -> ../coco/test2015
v2_mscoco_train2014_annotations.json
v2_mscoco_train2014_complementary_pairs.json
v2_mscoco_val2014_annotations.json
v2_OpenEnded_mscoco_test2015_questions.json
v2_OpenEnded_mscoco_test-dev2015_questions.json
v2_OpenEnded_mscoco_train2014_questions.json
v2_OpenEnded_mscoco_val2014_questions.json
vqav2_testdev.jsonl
vqav2_train.jsonl
vqav2_val.jsonl

**24 Chapter 1. Documentation**


### OKVQA

Follow the instructions below to prepare the data

# Step 1: Create the data directory
mkdir -p data/okvqa &&cd data/okvqa

# Step 2: Make sure you have downloaded COCO images
ln -s ../coco/train2014 ./
ln -s ../coco/val2014 ./

# Step 3: Download annotations and questions
wget https://okvqa.allenai.org/static/data/mscoco_train2014_annotations.json.zip &&␣
˓→unzip mscoco_train2014_annotations.json.zip
wget https://okvqa.allenai.org/static/data/OpenEnded_mscoco_train2014_questions.json.zip␣
˓→&& unzip OpenEnded_mscoco_train2014_questions.json.zip
wget https://okvqa.allenai.org/static/data/mscoco_val2014_annotations.json.zip&& unzip␣
˓→mscoco_val2014_annotations.json.zip
wget https://okvqa.allenai.org/static/data/OpenEnded_mscoco_val2014_questions.json.zip&&
˓→unzip OpenEnded_mscoco_val2014_questions.json.zip

# Step 4: Download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/okvqa/okvqa_
˓→train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/okvqa/okvqa_
˓→val.jsonl

cd ../..

After preparation is complete, the directory structure is:

data/okvqa
mscoco_train2014_annotations.json
mscoco_val2014_annotations.json
okvqa_train.jsonl
okvqa_val.jsonl
OpenEnded_mscoco_train2014_questions.json
OpenEnded_mscoco_val2014_questions.json
test2014 -> ../coco/test2014
val2014 -> ../coco/val2014

**VizWiz**

Follow the instructions below to prepare the data

# Step 1: Create the data directory
mkdir -p data/vizwiz&& cddata/vizwiz

# Step 2: Download images
wget https://vizwiz.cs.colorado.edu/VizWiz_final/images/train.zip&& unzip train.zip
wget https://vizwiz.cs.colorado.edu/VizWiz_final/images/val.zip &&unzip val.zip
wget https://vizwiz.cs.colorado.edu/VizWiz_final/images/test.zip&& unzip test.zip

# Step 3: Download annotations
(continues on next page)

**1.3. Evaluation Data Preparation 25**


```
(continued from previous page)
```
wget https://vizwiz.cs.colorado.edu/VizWiz_final/vqa_data/Annotations.zip&& unzip␣
˓→Annotations.zip

# Step 4: Download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_
˓→train_annotations.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_
˓→train_questions.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_
˓→train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_
˓→val_annotations.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_
˓→val_questions.json
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_
˓→val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/vizwiz/vizwiz_
˓→test.jsonl

cd ../..

After preparation is complete, the directory structure is:

data/vizwiz
annotations
test
train
val
vizwiz_test.jsonl
vizwiz_train_annotations.json
vizwiz_train.jsonl
vizwiz_train_questions.json
vizwiz_val_annotations.json
vizwiz_val.jsonl
vizwiz_val_questions.json

### GQA

Follow the instructions below to prepare the data

# Step 1: Create the data directory
mkdir -p data/gqa&& cd data/gqa

# Step 2: Download the official evaluation script
wget https://nlp.stanford.edu/data/gqa/eval.zip
unzip eval.zip

# Step 3: Download images
wget https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip
unzip images.zip

# Step 4: Download converted files
(continues on next page)

**26 Chapter 1. Documentation**


```
(continued from previous page)
```
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/gqa/testdev_
˓→balanced.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/gqa/train_
˓→balanced.jsonl
wget https://github.com/OpenGVLab/InternVL/releases/download/data/llava_gqa_testdev_
˓→balanced_qwen_format.jsonl

cd ../..

After preparation is complete, the directory structure is:

data/gqa
challenge_all_questions.json
challenge_balanced_questions.json
eval.py
images
llava_gqa_testdev_balanced_qwen_format.jsonl
readme.txt
submission_all_questions.json
test_all_questions.json
test_balanced.jsonl
test_balanced_questions.json
testdev_all_questions.json
testdev_balanced_all_questions.json
testdev_balanced_predictions.json
testdev_balanced_questions.json
train_all_questions
train_balanced.jsonl
train_balanced_questions.json
val_all_questions.json
val_balanced_questions.json

**ScienceQA**

Follow the instructions below to prepare the data

# Step 1: Create the data directory
mkdir -p data/scienceqa/images&& cd data/scienceqa/images

# Step 2: Download images
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/test.zip&& unzip test.zip

cd ..

# Step 3: Download original questions
wget https://github.com/lupantech/ScienceQA/blob/main/data/scienceqa/problems.json

# Step 4: Download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/scienceqa/
˓→scienceqa_test_img.jsonl

cd ../..

**1.3. Evaluation Data Preparation 27**


After preparation is complete, the directory structure is:

data/scienceqa
images
problems.json
scienceqa_test_img.jsonl

**SEED-Image**

Follow the instructions below to prepare the data:

# Step 1: Create the data directory
mkdir -p data/SEED&& cd data/SEED

# Step 2: Download the dataset
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/SEED-Bench-image.zip
unzip SEED-Bench-image.zip
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/seed.jsonl

cd ../..

After preparation is complete, the directory structure is:

data/SEED
SEED-Bench-image
seed.jsonl

### MMVP

Follow the instructions below to prepare the data:

# Step 1: Download the dataset
cd data/
git clone https://huggingface.co/datasets/MMVP/MMVP
cd ../

After preparation is complete, the directory structure is:

data/MMVP
MMVP Images
Questions.csv
Questions.xlsx
README.md

**Tiny-LVLM-eHub**

Follow the instructions below to prepare the data

# Step 1: Create the data directory
mkdir -p data/tiny_lvlm &&cd data/tiny_lvlm

# Step 2: Download the dataset
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/updated_datasets.zip
unzip updated_datasets.zip
(continues on next page)

**28 Chapter 1. Documentation**


```
(continued from previous page)
```
cd ../..

After preparation is complete, the directory structure is:

data/tiny_lvlm
updated_datasets

**1.3.10 Other Benchmarks**

For other benchmarks mentioned in theInternVL 2.5 technical reportbut not listed here, please useVLMEvalKitfor
evaluation.

## 1.4 Local Chat Demo

This document provides instructions for setting up the InternVL online demo. The system architecture includes a web
server, a controller, and multiple model workers.

**1.4.1 Streamlit Demo**

We developed a local demo using Streamlit. It looks like as shown in thegure below.

You can run it using the commands below. If you encounter python environment issues, please install the required
dependencies:

**1.4. Local Chat Demo 29**


pip install -r requirements/streamlit_demo.txt

**Step 1: Set Variables**

In your terminal, set the following variables:

export SD_SERVER_PORT= 39999
export WEB_SERVER_PORT= 10003
export CONTROLLER_PORT= 40000
export CONTROLLER_URL=http://0.0.0.0:$CONTROLLER_PORT
export SD_WORKER_URL=http://0.0.0.0:$SD_SERVER_PORT

**Step 2: Start the Streamlit Web Server**

Run the following command to start the Streamlit web server on port $WEB_SERVER_PORT:

streamlit run app.py --server.port$WEB_SERVER_PORT-- --controller_url $CONTROLLER_URL-
˓→-sd_worker_url $SD_WORKER_URL

Note: The -- -- is required and not a typo.

**Step 3: Start the Controller**

Run the following command to start the controller on port $CONTROLLER_PORT:

python controller.py --host 0 .0.0.0 --port$CONTROLLER_PORT

**Step 4: Start the Model Workers**

**InternVL2.5 Workers**

Run the following commands to start dierent InternVL2.5 workers with varying model sizes:

- InternVL2_5-1B Worker (port 40001):

```
CUDA_VISIBLE_DEVICES= 0 python model_worker.py --host 0 .0.0.0 --controller
˓→$CONTROLLER_URL --port 40001 --worker http://0.0.0.0:40001 --model-path OpenGVLab/
˓→InternVL2_5-1B
```
- InternVL2_5-2B Worker (port 40002):

```
CUDA_VISIBLE_DEVICES= 0 python model_worker.py --host 0 .0.0.0 --controller
˓→$CONTROLLER_URL --port 40002 --worker http://0.0.0.0:40002 --model-path OpenGVLab/
˓→InternVL2_5-2B
```
- InternVL2_5-4B Worker (port 40003):

```
CUDA_VISIBLE_DEVICES= 0 python model_worker.py --host 0 .0.0.0 --controller
˓→$CONTROLLER_URL --port 40003 --worker http://0.0.0.0:40003 --model-path OpenGVLab/
˓→InternVL2_5-4B
```
- InternVL2_5-8B Worker (port 40004):

**30 Chapter 1. Documentation**


```
CUDA_VISIBLE_DEVICES= 1 python model_worker.py --host 0 .0.0.0 --controller
˓→$CONTROLLER_URL --port 40004 --worker http://0.0.0.0:40004 --model-path OpenGVLab/
˓→InternVL2_5-8B
```
- InternVL2_5-26B Worker (port 40005):

```
CUDA_VISIBLE_DEVICES= 2 python model_worker.py --host 0 .0.0.0 --controller
˓→$CONTROLLER_URL --port 40005 --worker http://0.0.0.0:40005 --model-path OpenGVLab/
˓→InternVL2_5-26B
```
- InternVL2_5-38B Worker (port 40006, using 2 GPUs):

```
CUDA_VISIBLE_DEVICES= 3 ,4 python model_worker.py --host 0 .0.0.0 --controller
˓→$CONTROLLER_URL --port 40006 --worker http://0.0.0.0:40006 --model-path OpenGVLab/
˓→InternVL2_5-38B --device auto
```
- InternVL2_5-78B Worker (port 40007, using 3 GPUs):

```
CUDA_VISIBLE_DEVICES= 5 ,6,7 python model_worker.py --host 0 .0.0.0 --controller
˓→$CONTROLLER_URL --port 40007 --worker http://0.0.0.0:40007 --model-path OpenGVLab/
˓→InternVL2_5-78B --device auto
```
**InternVL2 Workers**

Run the following commands to start dierent InternVL2 workers with varying model sizes:

- InternVL2-1B Worker (port 40001):

```
CUDA_VISIBLE_DEVICES= 0 python model_worker.py --host 0 .0.0.0 --controller
˓→$CONTROLLER_URL --port 40001 --worker http://0.0.0.0:40001 --model-path OpenGVLab/
˓→InternVL2-1B
```
- InternVL2-2B Worker (port 40002):

```
CUDA_VISIBLE_DEVICES= 0 python model_worker.py --host 0 .0.0.0 --controller
˓→$CONTROLLER_URL --port 40002 --worker http://0.0.0.0:40002 --model-path OpenGVLab/
˓→InternVL2-2B
```
- InternVL2-4B Worker (port 40003):

```
CUDA_VISIBLE_DEVICES= 0 python model_worker.py --host 0 .0.0.0 --controller
˓→$CONTROLLER_URL --port 40003 --worker http://0.0.0.0:40003 --model-path OpenGVLab/
˓→InternVL2-4B
```
- InternVL2-8B Worker (port 40004):

```
CUDA_VISIBLE_DEVICES= 1 python model_worker.py --host 0 .0.0.0 --controller
˓→$CONTROLLER_URL --port 40004 --worker http://0.0.0.0:40004 --model-path OpenGVLab/
˓→InternVL2-8B
```
- InternVL2-26B Worker (port 40005):

```
CUDA_VISIBLE_DEVICES= 2 python model_worker.py --host 0 .0.0.0 --controller
˓→$CONTROLLER_URL --port 40005 --worker http://0.0.0.0:40005 --model-path OpenGVLab/
˓→InternVL2-26B
```
**1.4. Local Chat Demo 31**


- InternVL2-40B Worker (port 40006, using 2 GPUs):

```
CUDA_VISIBLE_DEVICES= 3 ,4 python model_worker.py --host 0 .0.0.0 --controller
˓→$CONTROLLER_URL --port 40006 --worker http://0.0.0.0:40006 --model-path OpenGVLab/
˓→InternVL2-40B --device auto
```
- InternVL2-Llama3-76B Worker (port 40007, using 3 GPUs):

```
CUDA_VISIBLE_DEVICES= 5 ,6,7 python model_worker.py --host 0 .0.0.0 --controller
˓→$CONTROLLER_URL --port 40007 --worker http://0.0.0.0:40007 --model-path OpenGVLab/
˓→InternVL2-Llama3-76B --device auto
```
**(Optional) Stable Diusion 3 Worker**

To enable the drawing functionality, run the following command to start the Stable Diusion 3 worker on port
$SD_SERVER_PORT:

CUDA_VISIBLE_DEVICES= 0 python sd_worker.py --port $SD_SERVER_PORT

**1.4.2 Gradio Demo**

TODO

**1.4.3 LMDeploy Demo**

TODO

## 1.5 InternVL-Chat API

Welcome to the free API of InternVL 2.5!

For detailed instructions on how to use it, please refer tothis document.

## 1.6 Enhancing InternVL2 on COCO Caption Using LoRA Fine-Tuning

In this tutorial, we will provide a detailed guide on how to use LoRA ne-tuning to improve the performance of a
trained InternVL2 model on COCO Caption.

Before starting, please prepare the InternVL training environment according to the _installation guide_. Note thatFlash
Attentionrequires manual installation following the steps provided. If you encounter any issues, check the issues section
on theirocial repository.

**1.6.1 Model Preparation**

After setting up the environment, navigate to the internvl_chat directory. You will need to download a pre-trained
InternVL2 model. The table below lists all available models in the InternVL2 series.

**32 Chapter 1. Documentation**


```
Model Name Type Params Download Link Size
InternVL2-1B MLLM 0.9B HF link 1.8 GB
InternVL2-2B MLLM 2.2B HF link 4.2 GB
InternVL2-4B MLLM 4.2B HF link 7.8 GB
InternVL2-8B MLLM 8.1B HF link 16 GB
InternVL2-26B MLLM 25.5B HF link 48 GB
InternVL2-40B MLLM 40.1B HF link 75 GB
InternVL2-Llama3-76B MLLM 76.3B HF link 143 GB
```
Below are the commands to download these models using huggingface_hub. Choose the model that suits your needs.

pip install -U huggingface_hub

cd pretrained/
# Download OpenGVLab/InternVL2-1B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-1B --local-dir InternVL2-1B
# Download OpenGVLab/InternVL2-2B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-2B --local-dir InternVL2-2B
# Download OpenGVLab/InternVL2-4B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-4B --local-dir InternVL2-4B
# Download OpenGVLab/InternVL2-8B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-8B --local-dir InternVL2-8B
# Download OpenGVLab/InternVL2-26B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-26B --local-dir InternVL2-26B
# Download OpenGVLab/InternVL2-40B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-40B --local-dir InternVL2-40B
# Download OpenGVLab/InternVL2-Llama3-76B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-Llama3-76B --local-dir InternVL2-Llama3-76B

For this tutorial, we’ll download the InternVL2-2B model. Navigate to the pretrained directory and execute the
second command above. After downloading, the directory structure should look like this:

pretrained
InternVL2-2B

**1.6.2 Data Preparation**

To enhance the InternVL2 model on COCO Caption, we need to prepare the COCO Caption dataset for both training
and testing.

Follow these instructions to prepare the COCO Caption data:

# Step 1: Create the data directory
mkdir -p data/coco&& cd data/coco

```
(continues on next page)
```
**1.6. Enhancing InternVL2 on COCO Caption Using LoRA Fine-Tuning 33**


```
(continued from previous page)
```
# Step 2: Download COCO images
wget [http://images.cocodataset.org/zips/train2014.zip&&](http://images.cocodataset.org/zips/train2014.zip&&) unzip train2014.zip
wget [http://images.cocodataset.org/zips/val2014.zip&&](http://images.cocodataset.org/zips/val2014.zip&&) unzip val2014.zip
wget [http://images.cocodataset.org/zips/test2015.zip&&](http://images.cocodataset.org/zips/test2015.zip&&) unzip test2015.zip

mkdir -p annotations&& cdannotations/

# Step 3: Download converted annotation files
wget https://github.com/OpenGVLab/InternVL/releases/download/data/coco_karpathy_test.json
wget https://github.com/OpenGVLab/InternVL/releases/download/data/coco_karpathy_test_gt.
˓→json
wget https://github.com/OpenGVLab/InternVL/releases/download/data/coco_karpathy_train_
˓→567k.zip
unzip coco_karpathy_train_567k.zip
cd ../../../

After downloading, the directory structure should be:

data/coco
annotations
coco_karpathy_test.json
coco_karpathy_test_gt.json
coco_karpathy_train_567k.jsonl
train2014
val2014
test2015

**1.6.3 Evaluating the Original Model**

With the data and model prepared, we can evaluate the InternVL2-2B model on COCO Caption.

Use the following command to test the model with 4 GPUs. Adjust the number of GPUs based on your setup:

GPUS= 4 sh evaluate.sh pretrained/InternVL2-2B caption-coco --dynamic

Initial evaluation results might be low since the InternVL2’s SFT data does not include COCO Caption. Here are some
expected results:

Bleu_1:0.640
Bleu_2:0.463
Bleu_3:0.320
Bleu_4:0.214
computing METEOR score...
METEOR:0.267
computing Rouge score...
ROUGE_L:0.504
computing CIDEr score...
CIDEr: 0.793

**34 Chapter 1. Documentation**


**1.6.4 LoRA Fine-Tuning**

Next, we’llne-tune the InternVL2-2B model using LoRA. Execute the following command forne-tuning:

GPUS= 8 PER_DEVICE_BATCH_SIZE= 4 sh shell/internvl2.0/2nd_finetune/internvl2_2b_internlm2_
˓→1_8b_dynamic_res_2nd_finetune_lora_coco.sh

Inthis script, we set the LoRA rank to 128, which means 6.24% of InternLM2-Chat-1.8B’s parameters will be trainable,
totaling 125.8M parameters:

trainable params: 125 ,829,120|| all params: 2 ,014,976,000||trainable%: 6.
˓→ 244695718460171

The total batch size is set to 512, with a per-device batch size of 4, consuming about 32G of memory. Training one
epoch on the COCO Caption dataset (566k entries) will require approximately 1100 iterations.

[INFO|trainer.py:1721] 2024 -07-31 22 :44:12,436 >> ***** Running training *****
[INFO|trainer.py:1722] 2024 -07-31 22 :44:12,436 >> Num examples= 566 ,747
[INFO|trainer.py:1723] 2024 -07-31 22 :44:12,436 >> Num Epochs = 1
[INFO|trainer.py:1724] 2024 -07-31 22 :44:12,436 >> Instantaneous batch size per device␣
˓→= 4
[INFO|trainer.py:1727] 2024 -07-31 22 :44:12,436 >> Total train batch size(w. parallel,␣
˓→distributed & accumulation) = 512
[INFO|trainer.py:1728] 2024 -07-31 22 :44:12,436 >> Gradient Accumulation steps= 16
[INFO|trainer.py:1729] 2024 -07-31 22 :44:12,436 >> Total optimizationsteps = 1 ,106
[INFO|trainer.py:1730] 2024 -07-31 22 :44:12,440 >> Number of trainable parameters= 125 ,
˓→829,120

Training with 8 A100 GPUs will take approximately 4 hours. If you encounter OOM issues, try reducing
PER_DEVICE_BATCH_SIZE to 2 or 1.

Please note that the hyperparameters provided here are arbitrary and may not be optimal. You can achieve better
performance by tuning the parameters.

**1.6.5 Monitoring with TensorBoard**

After starting the training, navigate to the directory:

cd work_dirs/internvl_chat_v2_0/internvl2_2b_internlm2_1_8b_dynamic_res_2nd_finetune_
˓→lora_coco

Start TensorBoard with the following command:

tensorboard --logdir ./ --port 10097 --host 0 .0.0.0

Then, open your web browser and navigate to [http://localhost:10097/](http://localhost:10097/) to view the training loss curves and other
metrics.

**1.6. Enhancing InternVL2 on COCO Caption Using LoRA Fine-Tuning 35**


**1.6.6 Evaluating the Fine-tuned Model**

Afterne-tuning, evaluate the model on COCO Caption again using the following command with 4 GPUs. Adjust the
number of GPUs based on your setup:

GPUS= 4 sh evaluate.sh work_dirs/internvl_chat_v2_0/internvl2_2b_internlm2_1_8b_dynamic_
˓→res_2nd_finetune_lora_coco caption-coco --dynamic

Thene-tuned model should show signicant improvement in COCO Caption evaluation metrics:

Bleu_1:0.805
Bleu_2:0.649
Bleu_3:0.504
Bleu_4:0.385
computing METEOR score...
METEOR:0.300
computing Rouge score...
ROUGE_L:0.595
computing CIDEr score...
CIDEr: 1.312

**1.6.7 Merging LoRA Weights**

After evaluating the ne-tuned model, you may want to merge the LoRA weights back into the original InternVL2
model. Follow these steps to accomplish this.

Use the provided script to merge the LoRA weights into the base model. The script takes two arguments: the input
path of thene-tuned model and the output path for the merged model.

python tools/merge_lora.py <input_path> <output_path>

For example:

python tools/merge_lora.py work_dirs/internvl_chat_v2_0/internvl2_2b_internlm2_1_8b_
˓→dynamic_res_2nd_finetune_lora_coco/ work_dirs/internvl_chat_v2_0/internvl2_2b_
˓→internlm2_1_8b_dynamic_res_2nd_finetune_lora_coco_merge

The script will output the following:

**36 Chapter 1. Documentation**


Loading model...
trainable params: 125 ,829,120|| all params: 2 ,014,976,000||trainable%: 6.
˓→ 244695718460171
Loading tokenizer...
Saving model...
Saving tokenizer...
Done!

**1.6.8 Wrapping into AutoModel**

After merging the LoRA weights, you can wrap thene-tuned InternVL2 model into an AutoModel for easier inference
or deployment.

First, copy all the Python scripts from the original InternVL2-2B directory to the new merged model directory:

cp pretrained/InternVL2-2B/*.py work_dirs/internvl_chat_v2_0/internvl2_2b_internlm2_1_8b_
˓→dynamic_res_2nd_finetune_lora_coco_merge/

Next, copy the config.jsonle from the original InternVL2-2B directory to the new merged model directory:

cp pretrained/InternVL2-2B/config.json work_dirs/internvl_chat_v2_0/internvl2_2b_
˓→internlm2_1_8b_dynamic_res_2nd_finetune_lora_coco_merge/

After copying the necessary les, you can now load and use the ne-tuned InternVL2 model with AutoModel for
inference or deployment.

**1.6.9 Conclusion**

This guide provided a step-by-step approach to enhancing the InternVL2 model on COCO Caption using LoRAne-
tuning. By following these instructions, you should be able to achieve improved performance in captioning tasks.
COCO Caption is just one example; you can replace it with other downstream datasets for ne-tuning. Happy ne-
tuning!

## 1.7 FAQs

**1.7.1 1. Are there performance metrics available for using InternVL2 for object de-**

**tection (including single object detection capabilities)?**

https://github.com/OpenGVLab/InternVL/issues/359

The model currently supports grounding tasks. For specic performance scores, please refer to this _link_. For more
general object detection and open-world detection, InternVL series models are evaluated on grounding in RefCOCO,
as shown in the table below:

**1.7. FAQs 37**


```
Model avg. Ref-
COCO(val)
```
```
Ref-
COCO(testA)
```
```
Ref-
COCO(testB)
```
```
Ref-
COCO+(val)
```
```
Ref-
COCO+(testA)
```
```
Ref-
COCO+(testB)
```
```
RefCOCO-g(val)RefCOCO-g(test)
```
### UNINEXT-

```
H(Specialist
SOTA)
```
### 88.9 92.6 94.3 91.5 85.2 89.6 79.8 88.7 89.4

```
Mini-
InternVL-
Chat-2B-V1-5
```
### 75.8 80.7 86.7 72.9 72.5 82.3 60.8 75.6 74.9

```
Mini-
InternVL-
Chat-4B-V1-5
```
### 84.4 88.0 91.4 83.5 81.5 87.4 73.8 84.7 84.6

```
InternVL-Chat-V1-588.8 91.4 93.7 87.1 87.0 92.3 80.9 88.5 89.3
```
```
InternVL2-1B 79.9 83.6 88.7 79.8 76.0 83.6 67.7 80.2 79.9
InternVL2-2B 77.7 82.3 88.2 75.9 73.5 82.8 63.3 77.6 78.3
InternVL2-4B 84.4 88.5 91.2 83.9 81.2 87.2 73.8 84.6 84.6
InternVL2-8B 82.9 87.1 91.1 80.7 79.8 87.9 71.4 82.7 82.7
InternVL2-26B 88.5 91.2 93.3 87.4 86.8 91.0 81.2 88.5 88.6
InternVL2-40B 90.3 93.0 94.7 89.2 88.5 92.8 83.6 90.3 90.6
InternVL2-
Llama3-76B
```
### 90.0 92.2 94.8 88.4 88.8 93.1 82.8 89.5 90.3

- We use the following prompt to evaluate InternVL’s grounding ability: Please provide the bounding box coordi-
    nates of the region this sentence describes: <ref>{}</ref>

**1.7.2 2. Specic format for multi-round dialogue and video in custom dataset format**

https://github.com/OpenGVLab/InternVL/issues/356

You can prepare data according to _this document_.

Format for multiple images:

{
"id": 0 ,
"image": ["path/to/image1.jpg", "path/to/image2.jpg", "path/to/image3.jpg"],
"width_list": [ 111 , 222 , 333 ],
"height_list": [ 111 , 222 , 333 ],
"conversations": [
{"from":"human","value":"<image>\nuser input <image>\nuser input"},
{"from":"gpt", "text":"assistant output"},
{"from":"human","value":"<image>\nuser input"},
{"from":"gpt", "text":"assistant output"}
]
}

**38 Chapter 1. Documentation**


**1.7.3 3. LORAne-tuning issue of InternVL2**

https://github.com/OpenGVLab/InternVL/issues/350 https://github.com/OpenGVLab/InternVL/issues/347

You can try updating to the latest code and thenne-tune according to the following document:

Fine-tuning InternVL 2.0: _see here_

Fine-tuning InternVL 1.5: _see here_

**1.7.4 4. Excessive security hardening of the Engineering Center online demo**

https://github.com/OpenGVLab/InternVL/issues/353

It is due to excessive security hardening, and we will continue to optimize this issue soon.

**1.7.5 5. Resource conguration required for model inference, deployment, andne-**

**tuning**

https://github.com/OpenGVLab/InternVL/issues/79 https://github.com/OpenGVLab/InternVL/issues/281
https://github.com/OpenGVLab/InternVL/issues/283 https://github.com/OpenGVLab/InternVL/issues/293
https://github.com/OpenGVLab/InternVL/issues/295

You can align the package versions in the dependency environment here:https://github.com/OpenGVLab/InternVL/
blob/main/internvl_chat/pyproject.toml. Alternatively, you can try deploying this _new local demo_.

InternVL-1-5 is a 26B model, with model parameters consuming about 50G of memory in BF16. Considering the
additional overhead during training, it requires around 100-150G. During training, you can use DeepSpeed Zero to
distribute these overheads across dierent GPUs.

**1.7.6 6. Abnormal generation results (including repetition, garbled text, etc.)**

https://github.com/OpenGVLab/InternVL/issues/289

This issue is due to an older version of transformers, please use transformers==4.37.2.

**1.7.7 7. Context length of each model**

https://github.com/OpenGVLab/InternVL/issues/272

InternVL-Chat-V1-5 has a 4k context length. Mini-InternVL-Chat-2B/4B-V1-5 has an 8k context length. All models
in the InternVL2 series have an 8k context length.

**1.7.8 8. Perform inference faster and with less GPU memory usage using a 4-bit**

**quantized model.**

https://github.com/OpenGVLab/InternVL/issues/250

Using the 4-bit model quantized by AWQ is recommended, which is very fast and occupies less GPU memory than
int8.

fromlmdeployimport pipeline
fromlmdeploy.messages importTurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL-Chat-V1-5-AWQ'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
backend_config= TurbomindEngineConfig(model_format='awq')
(continues on next page)

**1.7. FAQs 39**


```
(continued from previous page)
```
pipe= pipeline(model, backend_config=backend_config, log_level='INFO')
response= pipe(('describe this image', image))
print(response)

- or service

lmdeploy serve api_server OpenGVLab/InternVL-Chat-V1-5-AWQ --backend turbomind --model-
˓→format awq

**1.7.9 9. LMDeploy loading MiniInternVL error (due to lack of support for phi3)**

https://github.com/OpenGVLab/InternVL/issues/230

Only LMDeploy’s pytorch engine supports phi3 models, please refer to our latest README for spe-
cic usage. You can follow this document to deploy the InternVL2-4B model using lmdeploy:
https://internvl.readthedocs.io/en/latest/internvl2.0/deployment.html#launch-service

**1.7.10 10. How to deploy a local demo (streamlit version)**

Please refer to _this document_.

**1.7.11 11. What format should the dataset for detection and recognition be in for**

**end-to-end OCR implementation?**

https://github.com/OpenGVLab/InternVL/issues/536

The dataset format used for training OCR is as follows:

User: Please recognize the textin the image(there are dozens of similar templates like␣
˓→this).
Assistant: The text inthe image includes: \nXXX\nXXX\nXXX\nXXX(directly outputting the␣
˓→OCR results).

Note that bounding boxes for OCR were not utilized during training. For detailed information about the dataset format,
please refer to ourOCR Data Format Documentation

**1.7.12 12. How can the input length for the model be set?**

https://github.com/OpenGVLab/InternVL/issues/542

You can congure it in the model’s tokenizer_config.json, or modify the conguration after loading the model.

## 1.8 Introduction of InternVL3.0 Series

We introduce InternVL3, an advanced multimodal large language model (MLLM) series that demonstrates superior
overall performance. Compared to InternVL 2.5, InternVL3 exhibits superior multimodal perception and reasoning
capabilities, while further extending its multimodal capabilities to encompass tool usage, GUI agents, industrial image
analysis, 3D vision perception, and more. Additionally, we compare InternVL3 with Qwen2.5 Chat models, whose
corresponding pre-trained base models are employed as the initialization of the langauge component in InternVL3.
Benetting from Native Multimodal Pre-Training, the InternVL3 series achieves even better overall text performance
than the Qwen2.5 series.

**40 Chapter 1. Documentation**


InternVL3 family is built upon the following designs:

- **Variable Visual Position Encoding** : We integrates the Variable Visual Position Encoding (V2PE) , which uti-
    lizes smaller, moreexible position increments for visual tokens. This modication facilitates the handling of
    longer multimodal contexts without excessively extending the position window.
- **Native Multimodal Pre-Training** : We propose a Native Multimodal Pre-Training approach that consolidates
    language pre-training and multi-modal alignment training into a single pre-training stage. Unlike conventional
    paradigms—where a language-only large model is rst trained (typically with language pre-training followed
    by language post-training) and later adapted to accommodate additional modalities—our method performs in-
    tegrated optimization by interleaving multimodal data (e.g., image–text, video–text, or interleaved image–text
    sequences) with large-scale textual corpora during the pre-training process. This unied training scheme allows
    the pre-trainied model to learn both linguistic and multimodal capabilities simultaneously, ultimately enhancing
    its capability to handle vision-language tasks without introducing additional bridging modules or subsequent
    inter-model alignment procedures.
- **Mixed Preference Optimization** : During Pre-training and SFT, the model is trained to predict the next token
    conditioned on previous ground-truth tokens. However, during inference, the model predicts each token based
    on its own prior outputs. This discrepancy between ground-truth tokens and model-predicted tokens introduces
    a distribution shift, which can impair the model’s Chain-of-Thought (CoT) reasoning capabilities. To mitigate
    this issue, we employ Mixed Preference Optimization (MPO) , which introduces additional supervision from
    both positive and negative samples to align the model response distribution with the ground-truth distribution,
    thereby improving reasoning performance.
- **Test-Time Scaling with VisualPRM** : Test-Time Scaling has been shown to be an eective method to enhance
    the reasoning abilities of LLMs and MLLMs. In this work, we use the Best-of-N evaluation strategy and employ
    VisualPRM-8B as the critic model to select the best response for reasoning and mathematics evaluation.

**1.8. Introduction of InternVL3.0 Series 41**


The architecture of InternVL3 follows the same general framework as its predecessors, adhering to the “ViT-MLP-
LLM” paradigm. As in the previous version, we applied a pixel unshue operation, reducing the number of visual
tokens to one-quarter of the original. Besides, we adopted a similar dynamic resolution strategy as InternVL 1.5,
dividing images into tiles of 448×448 pixels. The key dierence, starting from InternVL 2.0, is that we addition-
ally introduced support for multi-image and video data. Notably, in InternVL3, we integrate the Variable Visual
Position Encoding (V2PE) , which utilizes smaller, moreexible position increments for visual tokens. Beneting
from V2PE, InternVL3 exhibits better long context understanding capabilities compared to its predecessors.

```
Model Name Vision Part Language Part HF Link MS Link
InternVL3-1B InternViT-300M-448px-V2_5 Qwen2.5-0.5B link link
InternVL3-2B InternViT-300M-448px-V2_5 Qwen2.5-1.5B link link
InternVL3-8B InternViT-300M-448px-V2_5 Qwen2.5-7B link link
InternVL3-9B InternViT-300M-448px-V2_5 internlm3-8b-instruct link link
InternVL3-14B InternViT-300M-448px-V2_5 Qwen2.5-14B link link
InternVL3-38B InternViT-6B-448px-V2_5 Qwen2.5-32B link link
InternVL3-78B InternViT-6B-448px-V2_5 Qwen2.5-72B link link
```
```
Model Name HF Link MS Link
VisualPRM-8B-v1.1 link link
```
**42 Chapter 1. Documentation**


**1.8.1 Evaluation on Multimodal Capability**

Multimodal Reasoning and Mathematics (ALL evaluation)

**1.8.2 Citation**

If yound this project useful in your research, please consider citing:

@misc{zhu2025internvl3exploringadvancedtraining,
title={InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-
˓→Source Multimodal Models},
author={Jinguo Zhu and Weiyun Wang and Zhe Chen and Zhaoyang Liu and Shenglong Ye␣
˓→and Lixin Gu and Hao Tian and Yuchen Duan and Weijie Su and Jie Shao and Zhangwei Gao␣
˓→and Erfei Cui and Xuehui Wang and Yue Cao and Yangzhou Liu and Xingguang Wei and␣
˓→Hongjie Zhang and Haomin Wang and Weiye Xu and Hao Li and Jiahao Wang and Nianchen␣
˓→Deng and Songze Li and Yinan He and Tan Jiang and Jiapeng Luo and Yi Wang and Conghui␣
˓→He and Botian Shi and Xingcheng Zhang and Wenqi Shao and Junjun He and Yingtong Xiong␣
(continues on next page)

**1.8. Introduction of InternVL3.0 Series 43**


(continued from previous page)
˓→and Wenwen Qu and Peng Sun and Penglong Jiao and Han Lv and Lijun Wu and Kaipeng Zhang␣
˓→and Huipeng Deng and Jiaye Ge and Kai Chen and Limin Wang and Min Dou and Lewei Lu and␣
˓→Xizhou Zhu and Tong Lu and Dahua Lin and Yu Qiao and Jifeng Dai and Wenhai Wang},
year={2025},
eprint={2504.10479},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.10479},
}

## 1.9 Quick Start of InternVL 3.0 Series

```
Please use transformers>=4.37.2 to ensure the model works normally.
```
**1.9.1 Model Preparation**

```
model name type param download size
InternVL3-1B MLLM 0.9B HF link 1.8 GB
InternVL3-2B MLLM 2.1B HF link 4.2 GB
InternVL3-8B MLLM 7.9B HF link 15.9 GB
InternVL3-9B MLLM 9.1B HF link 18.3 GB
InternVL3-14B MLLM 15.1B HF link 30.2 GB
InternVL3-38B MLLM 38.4B HF link 76.8 GB
InternVL3-78B MLLM 78.4B HF link 152 GB
```
Download the above model weights according to your need and place them in the pretrained/ folder.

pip install -U huggingface_hub

cd pretrained/
# Download OpenGVLab/InternVL3-1B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL3-1B --local-dir InternVL3-1B

# Download OpenGVLab/InternVL3-2B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL3-2B --local-dir InternVL3-2B

# Download OpenGVLab/InternVL3-8B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL3-8B --local-dir InternVL3-8B

# Download OpenGVLab/InternVL3-9B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL3-9B --local-dir InternVL3-9B

# Download OpenGVLab/InternVL3-14B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL3-14B --local-dir InternVL3-14B
(continues on next page)

**44 Chapter 1. Documentation**


```
(continued from previous page)
```
# Download OpenGVLab/InternVL3-38B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL3-38B --local-dir InternVL3-38B

# Download OpenGVLab/InternVL3-78B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL3-78B --local-dir InternVL3-78B

The directory structure is:

pretrained
InternVL3-1B
InternVL3-2B
InternVL3-8B
InternVL3-9B
InternVL3-14B
InternVL3-38B
InternVL3-78B

**1.9.2 Model Loading**

**16-bit (bf16 / fp16)**

1B

2B

8B

9B

14B

38B

78B

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL3-1B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL3-2B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
(continues on next page)

**1.9. Quick Start of InternVL 3.0 Series 45**


```
(continued from previous page)
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
```
import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL3-8B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL3-9B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL3-14B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL3-38B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL3-78B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
(continues on next page)

**46 Chapter 1. Documentation**


```
(continued from previous page)
trust_remote_code=True).eval().cuda()
```
**BNB 8-bit Quantization**

1B

2B

8B

9B

14B

38B

78B

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL3-1B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL3-2B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL3-8B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL3-9B"
(continues on next page)

**1.9. Quick Start of InternVL 3.0 Series 47**


```
(continued from previous page)
```
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL3-14B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL3-38B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL3-78B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

**BNB 4-bit Quantization**

1B

2B

8B

9B

14B

38B

**48 Chapter 1. Documentation**


### 78B

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL3-1B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_4bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL3-2B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_4bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL3-8B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_4bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL3-9B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_4bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL3-14B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
(continues on next page)

**1.9. Quick Start of InternVL 3.0 Series 49**


```
(continued from previous page)
load_in_4bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()
```
```
Warning: Due to signicant quantization errors with BNB 4-bit quantization on InternViT-6B, the model
may produce nonsensical outputs and fail to understand images. Therefore, please avoid using BNB 4-bit
quantization.
Warning: Due to signicant quantization errors with BNB 4-bit quantization on InternViT-6B, the model
may produce nonsensical outputs and fail to understand images. Therefore, please avoid using BNB 4-bit
quantization.
```
**Multiple GPUs**

The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not
being on the same device. By ensuring that therst and last layers of the large language model (LLM) are on the same
device, we prevent such errors.

import math
import torch
fromtransformersimport AutoTokenizer, AutoModel

defsplit_model(model_name):
device_map= {}
world_size= torch.cuda.device_count()
config =AutoConfig.from_pretrained(model_path, trust_remote_code=True)
num_layers= config.llm_config.num_hidden_layers
# Since the first GPU will be used for ViT, treat it as half a GPU.
num_layers_per_gpu= math.ceil(num_layers /(world_size-0.5))
num_layers_per_gpu= [num_layers_per_gpu]*world_size
num_layers_per_gpu[ 0 ]=math.ceil(num_layers_per_gpu[ 0 ] *0.5)
layer_cnt= 0
fori, num_layerin enumerate(num_layers_per_gpu):
forjin range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] =i
layer_cnt+= 1
device_map['vision_model']= 0
device_map['mlp1']= 0
device_map['language_model.model.tok_embeddings']= 0
device_map['language_model.model.embed_tokens']= 0
device_map['language_model.output']= 0
device_map['language_model.model.norm']= 0
device_map['language_model.model.rotary_emb'] = 0
device_map['language_model.lm_head']= 0
device_map[f'language_model.model.layers.{num_layers- 1 }']= 0

```
return device_map
```
### 1B

### 2B

### 8B

**50 Chapter 1. Documentation**


### 9B

### 14B

### 38B

### 78B

path= "OpenGVLab/InternVL3-1B"
device_map =split_model('InternVL3-1B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

path= "OpenGVLab/InternVL3-2B"
device_map =split_model('InternVL3-2B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

path= "OpenGVLab/InternVL3-8B"
device_map =split_model('InternVL3-8B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

path= "OpenGVLab/InternVL3-9B"
device_map =split_model('InternVL3-9B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

path= "OpenGVLab/InternVL3-14B"
device_map =split_model('InternVL3-14B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
(continues on next page)

**1.9. Quick Start of InternVL 3.0 Series 51**


```
(continued from previous page)
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()
```
path= "OpenGVLab/InternVL3-38B"
device_map =split_model('InternVL3-38B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

path= "OpenGVLab/InternVL3-78B"
device_map =split_model('InternVL3-78B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

**1.9.3 Inference with Transformers**

import math
import numpyas np
import torch
import torchvision.transformsas T
fromdecord import VideoReader, cpu
fromPILimport Image
fromtorchvision.transforms.functionalimport InterpolationMode
fromtransformersimport AutoModel, AutoTokenizer

IMAGENET_MEAN= (0.485, 0.456,0.406)
IMAGENET_STD=(0.229,0.224, 0.225)

defbuild_transform(input_size):
MEAN, STD= IMAGENET_MEAN, IMAGENET_STD
transform= T.Compose([
T.Lambda(lambda img: img.convert('RGB')if img.mode!= 'RGB'elseimg),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform

deffind_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff =float('inf')
best_ratio= ( 1 , 1 )
(continues on next page)

**52 Chapter 1. Documentation**


```
(continued from previous page)
area= width*height
forratioin target_ratios:
target_aspect_ratio= ratio[ 0 ]/ratio[ 1 ]
ratio_diff= abs(aspect_ratio-target_aspect_ratio)
if ratio_diff< best_ratio_diff:
best_ratio_diff= ratio_diff
best_ratio= ratio
elifratio_diff == best_ratio_diff:
if area>0.5* image_size* image_size* ratio[ 0 ] *ratio[ 1 ]:
best_ratio= ratio
return best_ratio
```
defdynamic_preprocess(image, min_num= 1 , max_num= 12 , image_size= 448 , use_
˓→thumbnail=False):
orig_width, orig_height= image.size
aspect_ratio=orig_width /orig_height

```
# calculate the existing image aspect ratio
target_ratios= set(
(i, j)forn in range(min_num, max_num+ 1 )foriin range( 1 , n+ 1 ) forjin␣
˓→range( 1 , n+ 1 )if
i* j<= max_numandi* j>= min_num)
target_ratios= sorted(target_ratios, key=lambdax: x[ 0 ]* x[ 1 ])
```
```
# find the closest aspect ratio to the target
target_aspect_ratio =find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
```
```
# calculate the target width and height
target_width=image_size *target_aspect_ratio[ 0 ]
target_height= image_size* target_aspect_ratio[ 1 ]
blocks =target_aspect_ratio[ 0 ]* target_aspect_ratio[ 1 ]
```
```
# resize the image
resized_img=image.resize((target_width, target_height))
processed_images=[]
foriin range(blocks):
box=(
(i% (target_width// image_size))*image_size,
(i// (target_width// image_size))* image_size,
((i%(target_width// image_size))+ 1 )* image_size,
((i//(target_width //image_size))+ 1 )* image_size
)
# split the image
split_img= resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) ==blocks
if use_thumbnailand len(processed_images)!= 1 :
thumbnail_img= image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
```
```
(continues on next page)
```
**1.9. Quick Start of InternVL 3.0 Series 53**


```
(continued from previous page)
```
defload_image(image_file, input_size= 448 , max_num= 12 ):
image= Image.open(image_file).convert('RGB')
transform= build_transform(input_size=input_size)
images =dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_
˓→num=max_num)
pixel_values=[transform(image) forimage in images]
pixel_values=torch.stack(pixel_values)
return pixel_values

defsplit_model(model_name):
device_map= {}
world_size= torch.cuda.device_count()
config =AutoConfig.from_pretrained(model_path, trust_remote_code=True)
num_layers= config.llm_config.num_hidden_layers
# Since the first GPU will be used for ViT, treat it as half a GPU.
num_layers_per_gpu= math.ceil(num_layers /(world_size-0.5))
num_layers_per_gpu= [num_layers_per_gpu]*world_size
num_layers_per_gpu[ 0 ]=math.ceil(num_layers_per_gpu[ 0 ] *0.5)
layer_cnt= 0
fori, num_layerin enumerate(num_layers_per_gpu):
forjin range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] =i
layer_cnt+= 1
device_map['vision_model']= 0
device_map['mlp1']= 0
device_map['language_model.model.tok_embeddings']= 0
device_map['language_model.model.embed_tokens']= 0
device_map['language_model.output']= 0
device_map['language_model.model.norm']= 0
device_map['language_model.model.rotary_emb'] = 0
device_map['language_model.lm_head']= 0
device_map[f'language_model.model.layers.{num_layers- 1 }']= 0

```
return device_map
```
# If you set`load_in_8bit=True`, you will need two 80GB GPUs.
# If you set`load_in_8bit=False`, you will need at least three 80GB GPUs.
path= 'OpenGVLab/InternVL3-1B'
device_map =split_model('InternVL3-1B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=False,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()
tokenizer= AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# set the max number of tiles in`max_num`
pixel_values=load_image('./examples/image1.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
generation_config= dict(max_new_tokens= 1024 , do_sample=True)
(continues on next page)

**54 Chapter 1. Documentation**


```
(continued from previous page)
```
# pure-text conversation ()
question= 'Hello, who are you?'
response, history= model.chat(tokenizer, None, question, generation_config,␣
˓→history=None, return_history=True)
print(f'User:{question}\nAssistant: {response}')

question= 'Can you tell me a story?'
response, history= model.chat(tokenizer, None, question, generation_config,␣
˓→history=history, return_history=True)
print(f'User:{question}\nAssistant: {response}')

# single-image single-round conversation ()
question= '<image>\nPlease describe the image shortly.'
response= model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User:{question}\nAssistant: {response}')

# single-image multi-round conversation ()
question= '<image>\nPlease describe the image in detail.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,␣
˓→history=None, return_history=True)
print(f'User:{question}\nAssistant: {response}')

question= 'Please write a poem according to the image.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,␣
˓→history=history, return_history=True)
print(f'User:{question}\nAssistant: {response}')

# multi-image multi-round conversation, combined images ()
pixel_values1= load_image('./examples/image1.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
pixel_values2= load_image('./examples/image2.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
pixel_values=torch.cat((pixel_values1, pixel_values2), dim= 0 )

question= '<image>\nDescribe the two images in detail.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
history=None, return_history=True)
print(f'User:{question}\nAssistant: {response}')

question= 'What are the similarities and differences between these two images.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
history=history, return_history=True)
print(f'User:{question}\nAssistant: {response}')

# multi-image multi-round conversation, separate images ()
pixel_values1= load_image('./examples/image1.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
pixel_values2= load_image('./examples/image2.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
pixel_values=torch.cat((pixel_values1, pixel_values2), dim= 0 )
num_patches_list=[pixel_values1.size( 0 ), pixel_values2.size( 0 )]

question= 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list,
(continues on next page)

**1.9. Quick Start of InternVL 3.0 Series 55**


(continued from previous page)
history=None, return_history=True)
print(f'User:{question}\nAssistant: {response}')

question= 'What are the similarities and differences between these two images.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list,
history=history, return_history=True)
print(f'User:{question}\nAssistant: {response}')

# batch inference, single image per sample ()
pixel_values1= load_image('./examples/image1.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
pixel_values2= load_image('./examples/image2.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
num_patches_list=[pixel_values1.size( 0 ), pixel_values2.size( 0 )]
pixel_values=torch.cat((pixel_values1, pixel_values2), dim= 0 )

questions= ['<image>\nDescribe the image in detail.'] *len(num_patches_list)
responses= model.batch_chat(tokenizer, pixel_values,
num_patches_list=num_patches_list,
questions=questions,
generation_config=generation_config)
forquestion, responsein zip(questions, responses):
print(f'User:{question}\nAssistant: {response}')

# video multi-round conversation ()
defget_index(bound, fps, max_frame, first_idx= 0 , num_segments= 32 ):
if bound:
start, end= bound[ 0 ], bound[ 1 ]
else:
start, end= - 100000 , 100000
start_idx= max(first_idx,round(start *fps))
end_idx=min(round(end* fps), max_frame)
seg_size=float(end_idx-start_idx) /num_segments
frame_indices= np.array([
int(start_idx+ (seg_size/ 2 )+np.round(seg_size *idx))
foridxin range(num_segments)
])
return frame_indices

defload_video(video_path, bound=None, input_size= 448 , max_num= 1 , num_segments= 32 ):
vr =VideoReader(video_path, ctx=cpu( 0 ), num_threads= 1 )
max_frame= len(vr) - 1
fps=float(vr.get_avg_fps())

```
pixel_values_list, num_patches_list= [], []
transform= build_transform(input_size=input_size)
frame_indices= get_index(bound, fps, max_frame, first_idx= 0 , num_segments=num_
˓→segments)
forframe_indexinframe_indices:
img=Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
img=dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_
˓→num=max_num)
pixel_values=[transform(tile)fortilein img]
(continues on next page)
```
**56 Chapter 1. Documentation**


```
(continued from previous page)
pixel_values=torch.stack(pixel_values)
num_patches_list.append(pixel_values.shape[ 0 ])
pixel_values_list.append(pixel_values)
pixel_values=torch.cat(pixel_values_list)
return pixel_values, num_patches_list
```
video_path ='./examples/red-panda.mp4'
pixel_values, num_patches_list= load_video(video_path, num_segments= 8 , max_num= 1 )
pixel_values=pixel_values.to(torch.bfloat16).cuda()
video_prefix=''.join([f'Frame{i+ 1 }: <image>\n' foriin range(len(num_patches_list))])
question= video_prefix+'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list, history=None, return_
˓→history=True)
print(f'User:{question}\nAssistant: {response}')

question= 'Describe this video in detail.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list, history=history,␣
˓→return_history=True)
print(f'User:{question}\nAssistant: {response}')

**Streaming Output**

Besides this method, you can also use the following code to get streamed output.

fromtransformersimport TextIteratorStreamer
fromthreading importThread

# Initialize the streamer
streamer= TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True,␣
˓→timeout= 10 )
# Define the generation configuration
generation_config= dict(max_new_tokens= 1024 , do_sample=False, streamer=streamer)
# Start the model chat in a separate thread
thread =Thread(target=model.chat, kwargs=dict(
tokenizer=tokenizer, pixel_values=pixel_values, question=question,
history=None, return_history=False, generation_config=generation_config,
))
thread.start()

# Initialize an empty string to store the generated text
generated_text= ''
# Loop through the streamer to get the new text as it is generated
fornew_textin streamer:
if new_text== model.conv_template.sep:
break
generated_text+= new_text
print(new_text, end='', flush=True) # Print each new chunk of generated text on the␣
˓→same line

**1.9. Quick Start of InternVL 3.0 Series 57**


**1.9.4 Citation**

If yound this project useful in your research, please consider citing:

@misc{zhu2025internvl3exploringadvancedtraining,
title={InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-
˓→Source Multimodal Models},
author={Jinguo Zhu and Weiyun Wang and Zhe Chen and Zhaoyang Liu and Shenglong Ye␣
˓→and Lixin Gu and Hao Tian and Yuchen Duan and Weijie Su and Jie Shao and Zhangwei Gao␣
˓→and Erfei Cui and Xuehui Wang and Yue Cao and Yangzhou Liu and Xingguang Wei and␣
˓→Hongjie Zhang and Haomin Wang and Weiye Xu and Hao Li and Jiahao Wang and Nianchen␣
˓→Deng and Songze Li and Yinan He and Tan Jiang and Jiapeng Luo and Yi Wang and Conghui␣
˓→He and Botian Shi and Xingcheng Zhang and Wenqi Shao and Junjun He and Yingtong Xiong␣
˓→and Wenwen Qu and Peng Sun and Penglong Jiao and Han Lv and Lijun Wu and Kaipeng Zhang␣
˓→and Huipeng Deng and Jiaye Ge and Kai Chen and Limin Wang and Min Dou and Lewei Lu and␣
˓→Xizhou Zhu and Tong Lu and Dahua Lin and Yu Qiao and Jifeng Dai and Wenhai Wang},
year={2025},
eprint={2504.10479},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.10479},
}

## 1.10 Fine-tune on a Custom Dataset

**1.10.1 Model Preparation**

```
model name type param download size
InternVL3-1B MLLM 0.9B HF link 1.8 GB
InternVL3-2B MLLM 2.1B HF link 4.2 GB
InternVL3-8B MLLM 7.9B HF link 15.9 GB
InternVL3-9B MLLM 9.1B HF link 18.3 GB
InternVL3-14B MLLM 15.1B HF link 30.2 GB
InternVL3-38B MLLM 38.4B HF link 76.8 GB
InternVL3-78B MLLM 78.4B HF link 152 GB
```
Before starting the secondne-tuning, download the pre-trained model we provide.

pip install -U huggingface_hub

cd pretrained/
# Download OpenGVLab/InternVL3-1B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL3-1B --local-dir InternVL3-1B

# Download OpenGVLab/InternVL3-2B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL3-2B --local-dir InternVL3-2B

# Download OpenGVLab/InternVL3-8B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
(continues on next page)

**58 Chapter 1. Documentation**


```
(continued from previous page)
˓→InternVL3-8B --local-dir InternVL3-8B
```
# Download OpenGVLab/InternVL3-9B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL3-9B --local-dir InternVL3-9B

# Download OpenGVLab/InternVL3-14B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL3-14B --local-dir InternVL3-14B

# Download OpenGVLab/InternVL3-38B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL3-38B --local-dir InternVL3-38B

# Download OpenGVLab/InternVL3-78B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL3-78B --local-dir InternVL3-78B

The directory structure is:

pretrained
InternVL3-1B
InternVL3-2B
InternVL3-8B
InternVL3-9B
InternVL3-14B
InternVL3-38B
InternVL3-78B

**1.10.2 Prepare Customized Data**

After downloading the pre-trained model, prepare your customized SFT (Supervised Fine-Tuning) data. Create a JSON
le in internvl_chat/shell/data/ similar tothis example.

The format for the JSONle should be:

{
"your-custom-dataset-1": {
"root": "path/to/the/image/",
"annotation":"path/to/the/jsonl/annotation",
"data_augment": false,
"max_dynamic_patch": 12 ,
"repeat_time": 1 ,
"length":"number of samples in the dataset"
}
}

Example:

{
"sharegpt4v_instruct_gpt4-vision_cap100k": {
"root": "playground/data/",
"annotation":"playground/opensource/sharegpt4v_instruct_gpt4-vision_cap100k.jsonl",
(continues on next page)

**1.10. Fine-tune on a Custom Dataset 59**


(continued from previous page)
"data_augment": false,
"max_dynamic_patch": 12 ,
"repeat_time": 1 ,
"length": 102025
}
}

The format for each specic JSONL (such as plain text data, single-image data, multi-image data, video data) can be
organized according to the descriptions provided in _this document_.

My suggestion is to add new domain-specic data on top of the _general data from our open-sourced InternVL 1.2_.
This will enhance downstream capabilities while retaining the foundational skills. Of course, you can also choose to
ne-tune solely on the new data based on your requirements.

**1.10.3 Start 2nd Fine-tuning**

1B

2B

8B

9B

14B

38B

78B

Fine-tune the pre-trained models using either thescript for training the full LLM.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL3-1B.

In the default settings, I have frozen the visual encoder. You can unfreeze it if needed. Generally, unfreezing the visual
encoder will result in better performance.

```
Fine-tuning the full LLM requires 8x 32G/40G GPUs, whereasne-tuning the LoRA requires 2x 32G/40G
GPUs.
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

# Using 8 GPUs, fine-tune the full LLM, cost about 30G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl3.0/2nd_finetune/internvl3_1b_dynamic_
˓→res_2nd_finetune_full.sh

Fine-tune the pre-trained models using either thescript for training the full LLM.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL3-2B.

In the default settings, I have frozen the visual encoder. You can unfreeze it if needed. Generally, unfreezing the visual
encoder will result in better performance.

```
Fine-tuning the full LLM requires 8x 32G/40G GPUs, whereasne-tuning the LoRA requires 2x 32G/40G
GPUs.
```
**60 Chapter 1. Documentation**


```
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

# Using 8 GPUs, fine-tune the full LLM, cost about 30G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl3.0/2nd_finetune/internvl3_2b_dynamic_
˓→res_2nd_finetune_full.sh

Fine-tune the pre-trained models using either thescript for training the full LLM.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL3-8B.

In the default settings, I have frozen the visual encoder. You can unfreeze it if needed. Generally, unfreezing the visual
encoder will result in better performance.

```
Fine-tuning the full LLM requires 8 A100 80G GPUs, whereas ne-tuning the LoRA requires 2 A100
80G GPUs.
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

# Using 8 GPUs, fine-tune the full LLM, cost about 40G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl3.0/2nd_finetune/internvl3_8b_dynamic_
˓→res_2nd_finetune_full.sh

Fine-tune the pre-trained models using either thescript for training the full LLM.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL3-9B.

In the default settings, I have frozen the visual encoder. You can unfreeze it if needed. Generally, unfreezing the visual
encoder will result in better performance.

```
Fine-tuning the full LLM requires 8 A100 80G GPUs, whereas ne-tuning the LoRA requires 2 A100
80G GPUs.
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

# Using 8 GPUs, fine-tune the full LLM, cost about 77G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl3.0/2nd_finetune/internvl3_9b_dynamic_
˓→res_2nd_finetune_full.sh

Fine-tune the pre-trained models using either thescript for training the full LLM.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL3-14B.

In the default settings, I have frozen the visual encoder. You can unfreeze it if needed. Generally, unfreezing the visual
encoder will result in better performance.

```
Fine-tuning the full LLM requires 8 A100 80G GPUs, whereas ne-tuning the LoRA requires 2 A100
80G GPUs.
```
Commands forne-tuning:

**1.10. Fine-tune on a Custom Dataset 61**


# Using 8 GPUs, fine-tune the full LLM, cost about 77G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl3.0/2nd_finetune/internvl3_14b_dynamic_
˓→res_2nd_finetune_full.sh

Fine-tune the pre-trained models using either thescript for training the full LLM.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL3-38B.

In the default settings, I have frozen the visual encoder. You can unfreeze it if needed. Generally, unfreezing the visual
encoder will result in better performance.

```
Fine-tuning the full LLM requires 16 A100 80G GPUs, whereasne-tuning the LoRA requires 2 A100
80G GPUs.
```
Commands forne-tuning:

# Using 16 GPUs with SLURM system, fine-tune the full LLM, cost about 77G per GPU
PARTITION='your partition'GPUS= 16 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl3.0/2nd_
˓→finetune/internvl3_38b_dynamic_res_2nd_finetune_full.sh

Fine-tune the pre-trained models using either thescript for training the full LLM.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL3-78B.

In the default settings, I have frozen the visual encoder. You can unfreeze it if needed. Generally, unfreezing the visual
encoder will result in better performance.

```
Fine-tuning the full LLM requires 32 A100 80G GPUs, whereasne-tuning the LoRA requires 8 A100
80G GPUs.
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

# Using 32 GPUs with SLURM system, fine-tune the full LLM, cost about 77G per GPU
PARTITION='your partition'GPUS= 32 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl3.0/2nd_
˓→finetune/internvl3_78b_dynamic_res_2nd_finetune_full.sh

If you encounter any issues, please let me know, and I will update the training guide to enhance its usability.

**1.10.4 Citation**

If yound this project useful in your research, please consider citing:

@misc{zhu2025internvl3exploringadvancedtraining,
title={InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-
˓→Source Multimodal Models},
author={Jinguo Zhu and Weiyun Wang and Zhe Chen and Zhaoyang Liu and Shenglong Ye␣
˓→and Lixin Gu and Hao Tian and Yuchen Duan and Weijie Su and Jie Shao and Zhangwei Gao␣
˓→and Erfei Cui and Xuehui Wang and Yue Cao and Yangzhou Liu and Xingguang Wei and␣
˓→Hongjie Zhang and Haomin Wang and Weiye Xu and Hao Li and Jiahao Wang and Nianchen␣
˓→Deng and Songze Li and Yinan He and Tan Jiang and Jiapeng Luo and Yi Wang and Conghui␣
˓→He and Botian Shi and Xingcheng Zhang and Wenqi Shao and Junjun He and Yingtong Xiong␣
˓→and Wenwen Qu and Peng Sun and Penglong Jiao and Han Lv and Lijun Wu and Kaipeng Zhang␣
˓→and Huipeng Deng and Jiaye Ge and Kai Chen and Limin Wang and Min Dou and Lewei Lu and␣
(continues on next page)

**62 Chapter 1. Documentation**


(continued from previous page)
˓→Xizhou Zhu and Tong Lu and Dahua Lin and Yu Qiao and Jifeng Dai and Wenhai Wang},
year={2025},
eprint={2504.10479},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.10479},
}

## 1.11 Evaluation of InternVL3 Series

To evaluate the performance of the InternVL3 series across various tasks, follow the instructions for each specic
dataset. Ensure that the appropriate number of GPUs is allocated as specied.

```
1 We mainly use VLMEvalKit repositories for model evaluation.
2 Please note that evaluating the same model using dierent testing toolkits like InternVL and VLMEvalKit
can result in slight dierences, which is normal. Updates to code versions and variations in environment
and hardware can also cause minor discrepancies in results.
```
**1.11.1 Model Preparation**

```
model name type param download size
InternVL3-1B MLLM 0.9B HF link 1.8 GB
InternVL3-2B MLLM 2.1B HF link 4.2 GB
InternVL3-8B MLLM 7.9B HF link 15.9 GB
InternVL3-9B MLLM 9.1B HF link 18.3 GB
InternVL3-14B MLLM 15.1B HF link 30.2 GB
InternVL3-38B MLLM 38.4B HF link 76.8 GB
InternVL3-78B MLLM 78.4B HF link 152 GB
```
Before evaluation, download the trained model we provide.

pip install -U huggingface_hub

cd pretrained/
# Download OpenGVLab/InternVL3-1B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL3-1B --local-dir InternVL3-1B

# Download OpenGVLab/InternVL3-2B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL3-2B --local-dir InternVL3-2B

# Download OpenGVLab/InternVL3-8B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL3-8B --local-dir InternVL3-8B

# Download OpenGVLab/InternVL3-9B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL3-9B --local-dir InternVL3-9B
(continues on next page)

**1.11. Evaluation of InternVL3 Series 63**


```
(continued from previous page)
```
# Download OpenGVLab/InternVL3-14B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL3-14B --local-dir InternVL3-14B

# Download OpenGVLab/InternVL3-38B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL3-38B --local-dir InternVL3-38B

# Download OpenGVLab/InternVL3-78B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL3-78B --local-dir InternVL3-78B

The directory structure is:

pretrained
InternVL3-1B
InternVL3-2B
InternVL3-8B
InternVL3-9B
InternVL3-14B
InternVL3-38B
InternVL3-78B

**1.11.2 Evaluation using VLMEvalKit Codebase**

We evaluate the performance on most benchmarks ( _e.g._ , MMVet, LLaVABench, and CRPE) usingVLMEvalKit. You
need to set and USE_COT="1" in environment variable to activate the CoT prompt.

**Data Preparation**

VLMEvalKit will automatically download the data for evaluation, so you do not need to prepare it manually.

**Evaluation on Dierent Benchmarks**

To evaluate our models on dierent benchmarks, you can refer to the following script:

#!/bin/bash
set-x
PARTITION=${PARTITION:-"Intern5"}
GPUS=${GPUS:- 64 }
GPUS_PER_NODE=${GPUS_PER_NODE:- 8 }
GPUS_PER_TASK=${GPUS_PER_TASK:- 1 }
QUOTA_TYPE=${QUOTA_TYPE:-"reserved"}

declare-a models=( \
"InternVL3-1B" \
"InternVL3-2B" \
"InternVL3-8B" \
"InternVL3-9B" \
"InternVL3-14B" \
"InternVL3-38B" \
"InternVL3-78B" \
(continues on next page)

**64 Chapter 1. Documentation**


```
(continued from previous page)
```
)

datasets="MMBench_TEST_EN_V11 MMStar MMMU_DEV_VAL MathVista_MINI HallusionBench AI2D_
˓→TEST OCRBench MMVet"
LOG_DIR="logs_eval"

export OPENAI_API_KEY="xxx"

for((i= 0 ; i<${#models[@]}; i++));do

```
model=${models[i]}
```
```
if [[ "$model"=~ 38B|78B ]]; then
GPUS_PER_TASK= 8
else
GPUS_PER_TASK= 1
fi
```
```
srun -p${PARTITION} \
--gres=gpu:${GPUS_PER_NODE} \
--ntasks=$((GPUS/GPUS_PER_TASK)) \
--ntasks-per-node=$((GPUS_PER_NODE/GPUS_PER_TASK)) \
--quotatype=${QUOTA_TYPE}\
--job-name="eval_wwy" \
-o "${LOG_DIR}/${model}/evaluation.log" \
-e "${LOG_DIR}/${model}/evaluation.log" \
--async\
python -u run.py\
--data ${datasets} \
--model${model}\
--verbose\
```
done

Note that VLMEvalkit does not ocially support launching evaluation tasks with Slurm. You need to modify the
run.pyscript to support the Slurm launcher as follows:

definit_dist():
if 'RANK' inos.environand'WORLD_SIZE' inos.environ:
pass
elif'SLURM_PROCID' in os.environ:
rank=int(os.getenv('SLURM_PROCID',' 0 '))
world_size= int(os.getenv('SLURM_NTASKS', ' 1 '))
local_rank= rank%torch.cuda.device_count()

```
os.environ['RANK']= str(rank)
os.environ['LOCAL_RANK'] =str(local_rank)
os.environ['WORLD_SIZE'] =str(world_size)
```
```
if 'MASTER_ADDR'not in os.environ:
node_list= os.environ["SLURM_NODELIST"]
addr=subprocess.getoutput(f"scontrol show hostname {node_list}| head -n1")
(continues on next page)
```
**1.11. Evaluation of InternVL3 Series 65**


```
(continued from previous page)
os.environ['MASTER_ADDR'] =addr
if 'MASTER_PORT'not in os.environ:
os.environ['MASTER_PORT'] =' 22110 '
```
...

if __name__== '__main__':
load_env()
init_dist()
main()

Please refer to theirdocumentfor more details.

**1.11.3 Citation**

If yound this project useful in your research, please consider citing:

@misc{zhu2025internvl3exploringadvancedtraining,
title={InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-
˓→Source Multimodal Models},
author={Jinguo Zhu and Weiyun Wang and Zhe Chen and Zhaoyang Liu and Shenglong Ye␣
˓→and Lixin Gu and Hao Tian and Yuchen Duan and Weijie Su and Jie Shao and Zhangwei Gao␣
˓→and Erfei Cui and Xuehui Wang and Yue Cao and Yangzhou Liu and Xingguang Wei and␣
˓→Hongjie Zhang and Haomin Wang and Weiye Xu and Hao Li and Jiahao Wang and Nianchen␣
˓→Deng and Songze Li and Yinan He and Tan Jiang and Jiapeng Luo and Yi Wang and Conghui␣
˓→He and Botian Shi and Xingcheng Zhang and Wenqi Shao and Junjun He and Yingtong Xiong␣
˓→and Wenwen Qu and Peng Sun and Penglong Jiao and Han Lv and Lijun Wu and Kaipeng Zhang␣
˓→and Huipeng Deng and Jiaye Ge and Kai Chen and Limin Wang and Min Dou and Lewei Lu and␣
˓→Xizhou Zhu and Tong Lu and Dahua Lin and Yu Qiao and Jifeng Dai and Wenhai Wang},
year={2025},
eprint={2504.10479},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.10479},
}

## 1.12 Deploy InternVL3 Series

**1.12.1 LMDeploy**

LMDeployis a toolkit for compressing, deploying, and serving LLMs & VLMs.

pip install lmdeploy>= 0 .6.4 --no-deps

LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-
use pipeline, similar to the Large Language Model (LLM) inference pipeline.

**A ‘Hello, World’ Example**

1B

2B

8B

**66 Chapter 1. Documentation**


### 9B

### 14B

### 38B

### 78B

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL3-1B'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))
response= pipe(('describe this image', image))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL3-2B'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))
response= pipe(('describe this image', image))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL3-8B'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))
response= pipe(('describe this image', image))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL3-9B'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))
response= pipe(('describe this image', image))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
(continues on next page)

**1.12. Deploy InternVL3 Series 67**


```
(continued from previous page)
```
model= 'OpenGVLab/InternVL3-14B'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))
response= pipe(('describe this image', image))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL3-38B'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 , tp= 2 ),␣
˓→chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))
response= pipe(('describe this image', image))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL3-78B'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 , tp= 4 ),␣
˓→chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))
response= pipe(('describe this image', image))
print(response.text)

If ImportError occurs while executing this case, please install the required packages as prompted.

**Multi-Images Inference**

When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a
higher number of input tokens, and as a result, the size of the context window typically needs to be increased.

```
Warning: Due to the scarcity of multi-image conversation data, the performance on multi-image tasks may
be unstable, and it may require multiple attempts to achieve satisfactory results.
```
1B

2B

8B

9B

14B

38B

78B

**68 Chapter 1. Documentation**


fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/InternVL3-1B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/InternVL3-2B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/InternVL3-8B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
(continues on next page)

**1.12. Deploy InternVL3 Series 69**


```
(continued from previous page)
```
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/InternVL3-9B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/InternVL3-14B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN
(continues on next page)

**70 Chapter 1. Documentation**


```
(continued from previous page)
```
model= 'OpenGVLab/InternVL3-38B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 , tp= 2 ),␣
˓→chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/InternVL3-78B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 , tp= 4 ),␣
˓→chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

**Batch Prompts Inference**

Conducting inference with batch prompts is quite straightforward; just place them within a list structure:

1B

2B

8B

9B

14B

38B

78B

**1.12. Deploy InternVL3 Series 71**


fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL3-1B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL3-2B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL3-8B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL3-9B'
(continues on next page)

**72 Chapter 1. Documentation**


```
(continued from previous page)
```
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL3-14B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL3-38B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL3-78B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 , tp= 4 ),␣
˓→chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

```
(continues on next page)
```
**1.12. Deploy InternVL3 Series 73**


```
(continued from previous page)
```
image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

**Multi-Turn Conversation**

There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the
format of OpenAI and use above introduced method, the other is to use the pipeline.chat interface.

1B

2B

8B

9B

14B

38B

78B

fromlmdeployimport pipeline, TurbomindEngineConfig, GenerationConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL3-1B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig, GenerationConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL3-2B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
(continues on next page)

**74 Chapter 1. Documentation**


```
(continued from previous page)
```
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig, GenerationConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL3-8B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig, GenerationConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL3-9B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig, GenerationConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL3-14B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig, GenerationConfig
fromlmdeploy.vlimport load_image

```
(continues on next page)
```
**1.12. Deploy InternVL3 Series 75**


```
(continued from previous page)
```
model= 'OpenGVLab/InternVL3-38B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ), chat_
˓→template_config=ChatTemplateConfig(model_name='internvl2_5'))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig, GenerationConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL3-78B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 , tp= 4 ),␣
˓→chat_template_config=ChatTemplateConfig(model_name='internvl2_5'))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

**Serving**

**Launch Service**

1B

2B

8B

9B

14B

38B

78B

LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

lmdeploy serve api_server OpenGVLab/InternVL3-1B --backend turbomind --server-port 23333

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

**76 Chapter 1. Documentation**


lmdeploy serve api_server OpenGVLab/InternVL3-2B --backend turbomind --server-port 23333

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

lmdeploy serve api_server OpenGVLab/InternVL3-8B --backend turbomind --server-port 23333

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

lmdeploy serve api_server OpenGVLab/InternVL3-9B --backend turbomind --server-port 23333

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

```
Warning: Please make sure to install Flash Attention; otherwise, using --tp will cause errors.
```
LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

lmdeploy serve api_server OpenGVLab/InternVL3-14B --backend turbomind --server-port 23333

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

```
Warning: Please make sure to install Flash Attention; otherwise, using --tp will cause errors.
```
LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

lmdeploy serve api_server OpenGVLab/InternVL3-38B --backend turbomind --server-port␣
˓→ 23333 --tp 2

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

```
Warning: Please make sure to install Flash Attention; otherwise, using --tp will cause errors.
```
LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

CUDA_VISIBLE_DEVICES= 0 ,1,2,3 lmdeploy serve api_server OpenGVLab/InternVL3-78B --backend␣
˓→turbomind --server-port 23333 --tp 4

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

**1.12. Deploy InternVL3 Series 77**


**Integrate with** OpenAI

Here is an example of interaction with the endpoint v1/chat/completions service via the openai package. Before
running it, please install the openai package by pip install openai.

fromopenai import OpenAI

client =OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name =client.models.list().data[ 0 ].id
response= client.chat.completions.create(
model=model_name,
messages=[{
'role':
'user',
'content': [{
'type': 'text',
'text': 'describe this image',
}, {
'type': 'image_url',
'image_url': {
'url':
'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
},
}],
}],
temperature=0.8,
top_p=0.8)
print(response)

If you encounter any issues or need advanced usage with lmdeploy, we recommend reading thelmdeploy documen-
tation.

**1.12.2 vLLM**

Coming soon...

**1.12.3 Ollama**

Coming soon...

## 1.13 Mixed Preference Optimization

```
Please use trl==0.10.1 to ensure the model works normally.
```
**78 Chapter 1. Documentation**


**1.13.1 Model Preparation**

```
model name type param download size
InternVL3-1B MLLM 0.9B HF link 1.8 GB
InternVL3-2B MLLM 2.1B HF link 4.2 GB
InternVL3-8B MLLM 7.9B HF link 15.9 GB
InternVL3-9B MLLM 9.1B HF link 18.3 GB
InternVL3-14B MLLM 15.1B HF link 30.2 GB
InternVL3-38B MLLM 38.4B HF link 76.8 GB
InternVL3-78B MLLM 78.4B HF link 152 GB
```
Before starting the preference optimization, download the pre-trained model we provide.

cd ckpt/
# pip install -U huggingface_hub
# Download OpenGVLab/InternVL3-8B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL3-8B --local-dir InternVL3-8B

The directory structure is:

ckpt
InternVL3-8B

**1.13.2 Prepare Our MMPR Dataset**

To prepare the training data, pleaserst download ourMMPR datasetandthe JSONle.

Our dataset contains approximately 3 million preference pairs, of which only around 400k are utilized during training.
You can adjust the number of active data samples and the data mixture ratio by modifying the repeat parameter in the
JSONle.

The directory structure is:

MMPR
images
annotations

Please note that our training data includes instructions collected fromInternVL demo. However, due to privacy pro-
tection concerns, we are unable to release these portion of the data. Therefore, the reproduced results on general VQA
( _i.e._ , MMVet, LLaVABench, and MMHal-Bench) may be inferior toour released model.

We recommend incorporating additional general VQA data to preserve the general VQA abilities, following _our
DropoutNTP pipeline_.

**1.13.3 Prepare Customized Data**

If you want to prepare your customized preference data, please create a JSONle similar tothis example.

The format for the JSONle should be:

{
"your-custom-dataset-1": {
"root": "path/to/the/image/",
(continues on next page)

**1.13. Mixed Preference Optimization 79**


(continued from previous page)
"annotation":"path/to/the/jsonl/annotation",
"data_augment": false,
"max_dynamic_patch": 12 ,
"repeat_time": 1 ,
"length":"number of samples in the dataset"
}
}

Example:

{
"scienceqa_multi_choice_en_20240402_extracted_pairs_vqa_format_rules": {
"root": "MMPR/images/ScienceQA",
"annotation":"MMPR/annotations/scienceqa_multi_choice_en_20240402_extracted_pairs_
˓→vqa_format_rules.jsonl",
"data_augment": false,
"repeat_time": 1 ,
"length": 66457
}
}

The format for each specic JSONL (such as plain text data, single-image data, multi-image data) can be organized as
the following format:

{"image":"1.png","question":"xxx", "chosen":"xxx", "rejected":"xxx",}
{"image":"2.png","question":"xxx", "chosen":"xxx", "rejected":"xxx",}
...

Our suggestion is to add new domain-specic data on top ofMMPR. This will enhance downstream capabilities while
retaining the foundational skills. Of course, you can also choose to ne-tune solely on the new data based on your
requirements.

**1.13.4 Start Preference Optimization**

Commands for preference optimization:

cd internvl_chat
sh shell/internvl3.0/mpo/internvl3_8b_mpo.sh

If you encounter any issues, please let us know, and we will update the training guide to enhance its usability.

```
Based on the environment of InternVL, you need to additionally run pip install trl==0.10.1.
```
**1.13.5 Evaluation**

We evaluate the performance on other benchmarks ( _e.g._ , MMVet, LLaVABench, and CRPE) usingVLMEvalKit. You
need to set use_mpo_prompt=True incong.pyand USE_COT="1" in environment variable to activate the CoT prompt.

**1.13.6 Generate Additional Preference Data**

To construct additional open-ended VQA preference data, you can use ourDropoutNTP pipelinewith the following
command:

**80 Chapter 1. Documentation**


srun -p${PARTITION}\
--gres=gpu:${GPUS_PER_NODE} \
--nodes=${NODES}\
--ntasks=${GPUS}\
--ntasks-per-node=${GPUS_PER_NODE}\
--cpus-per-task=${CPUS_PER_TASK} \
--kill-on-bad-exit= 1 \
--quotatype=${QUOTA_TYPE}\
python -u tools/reasoning_data_pipeline/mmpr_data_pipeline_dropout_ntp.py\
--checkpoint${model_path}\ # the model you want to use to generate negative␣
˓→samples
--prompt-path${dataset} \ # please refer to the following format example
--out-dir${out_dir}\ # the output directory you want to save the resulting data
--batch-size 1 \
--num-workers 8 \
--num-return-sequences 1 \ # the number of generated negative samples per item
--top-k 50 \
--temperature 1 .0 \
--dynamic\
--max-num${max_num}\ # max_tiles when enabling dynamic resolution
--sample-max-num 500000 \
--tp 8 \
--start-ratio${START_RATIO} \ # We set it to 0.5 by default
2 >& 1 | tee -a"${LOG_PATH}" # the file path you want to save your log

The format for the promptle should be:

{"image":"1.png","question":"xxx", "chosen":"xxx", "rejected":null,}
{"image":"2.png","question":"xxx", "chosen":"xxx", "rejected":null,}
...

To constrct additional CoT reasoning preference data, you can use ourcorrectness-based pipelinewith the following
command:

srun -p${PARTITION}\
--gres=gpu:${GPUS_PER_NODE} \
--nodes=${NODES}\
--ntasks=${GPUS}\
--ntasks-per-node=${GPUS_PER_NODE}\
--cpus-per-task=${CPUS_PER_TASK} \
--kill-on-bad-exit= 1 \
--quotatype=${QUOTA_TYPE}\
python -u tools/reasoning_data_pipeline/mmpr_data_pipeline_correctness.py\
--checkpoint${model_path}\ # the model you want to use to generate negative␣
˓→samples
--prompt-path${dataset} \ # please refer to the following format example
--out-dir${out_dir}\ # the output directory you want to save the resulting data
--batch-size 1 \
--num-workers 8 \
--num-return-sequences 32 \ # the number of generated reasoning processes per item
--top-k 50 \
--temperature 1 .0 \
--dynamic\
(continues on next page)

**1.13. Mixed Preference Optimization 81**


(continued from previous page)
--max-num${max_num}\ # max_tiles when enabling dynamic resolution
--sample-max-num 20000 \
--tp 8 \
2 >& 1 | tee -a"${LOG_PATH}" # the file path you want to save your log

The format for the promptle should be:

{"image":"1.png","question":"xxx", "answer":"xxx"}
{"image":"2.png","question":"xxx", "answer":"xxx"}
...

After sample multiple reasoning processes, you can use this command to convert them into preference data based on
the correctness:

python -u tools/reasoning_data_pipeline/mmpr_data_pipeline_correctness_postprocess.py\
--data-dir"${data_dir}" \ # should be same with the ${out_dir} when sampling␣
˓→reasoning processes
--save-dir"${save_dir}" \ # the output directory you want to save the resulting␣
˓→data
--answer-fix\
--force\
--num-pairs-per-key 15 \
--max-lines 1200000 \

**1.13.7 Citation**

If yound this project useful in your research, please consider citing:

@misc{zhu2025internvl3exploringadvancedtraining,
title={InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-
˓→Source Multimodal Models},
author={Jinguo Zhu and Weiyun Wang and Zhe Chen and Zhaoyang Liu and Shenglong Ye␣
˓→and Lixin Gu and Hao Tian and Yuchen Duan and Weijie Su and Jie Shao and Zhangwei Gao␣
˓→and Erfei Cui and Xuehui Wang and Yue Cao and Yangzhou Liu and Xingguang Wei and␣
˓→Hongjie Zhang and Haomin Wang and Weiye Xu and Hao Li and Jiahao Wang and Nianchen␣
˓→Deng and Songze Li and Yinan He and Tan Jiang and Jiapeng Luo and Yi Wang and Conghui␣
˓→He and Botian Shi and Xingcheng Zhang and Wenqi Shao and Junjun He and Yingtong Xiong␣
˓→and Wenwen Qu and Peng Sun and Penglong Jiao and Han Lv and Lijun Wu and Kaipeng Zhang␣
˓→and Huipeng Deng and Jiaye Ge and Kai Chen and Limin Wang and Min Dou and Lewei Lu and␣
˓→Xizhou Zhu and Tong Lu and Dahua Lin and Yu Qiao and Jifeng Dai and Wenhai Wang},
year={2025},
eprint={2504.10479},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.10479},
}

## 1.14 Introduction of InternVL2.5 Series

We introduce InternVL2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL
2.0, maintaining its core model architecture while introducing signicant enhancements in training and testing strategies
as well as data quality.

**82 Chapter 1. Documentation**


In this work, we delve into the relationship between model scaling and performance, systematically exploring the per-
formance trends in vision encoders, language models, dataset sizes, and test-time congurations. Through extensive
evaluations on a wide range of benchmarks, InternVL2.5 exhibits competitive performance, rivaling leading commer-
cial models such as GPT-4o and Claude-3.5-Sonnet.

Notably, **our model is thefirst open-source MLLMs to achieve over 70% on the MMMU benchmark.** We hope
this model contributes to the open-source community by setting new standards for developing and applying multimodal
AI systems.

InternVL2.5 family is built upon the following designs:

- **Progressive Scaling Strategy** : We propose a progressive scaling strategy to eciently align the vision encoder
    with LLMs. This strategy adopts a staged training approach, starting with smaller, resource-ecient LLMs and
    progressively scaling up to larger LLMs. This approach stems from our observation that even when the ViT
    and LLM are jointly trained using NTP loss, the resulting visual features are generalizable representations that
    can be easily understood by other LLMs. Specically, the InternViT is trained alongside a smaller LLM (e.g.,
    20B), focusing on optimizing fundamental visual capabilities and cross-modal alignment. This phase avoids the
    high computational costs associated with training directly with a large LLM. Using a shared-weight mechanism,
    the trained InternViT can be seamlessly transferred to a larger LLM (e.g., 72B) without requiring retraining.
    Consequently, when training a larger model, much less data is required and the computation cost is signicantly
    reduced.
- **Improved Training Strategy** : To enhance the model’s adaptability to real-world scenarios and overall perfor-
    mance, we introduce two key techniques: Random JPEG Compression and Loss Reweighting. For Random
    JPEG Compression, random JPEG compression with quality levels between 75 and 100 is applied to simulate
    the degradation commonly found in internet-sourced images. For Loss Reweighting, we express the widely ap-
    plied strategies (i.e., token averaging and sample averaging) in a unied format and propose square averaging to
    balance the gradients biases towards long or short responses.
- **Well-structed Data Organization** : During model development, we observed that even a small fraction of
    anomalous samples can lead to aberrant model behavior during inference. To address this issue, we propose
    altering pipeline consisting of LLM-Based Quality Scoring and Rule-Based Filtering, which signicantly re-
    duced the occurrence of anomalous behaviors, particularly repetitive generation, with notable improvements in
    CoT reasoning tasks. Additionally, we implement a data-packing strategy to enhance GPU utilization and im-
    prove training eciency.

**1.14. Introduction of InternVL2.5 Series 83**


As shown in thisgure, InternVL2.5 utilizes the same architecture as InternVL 1.5 and InternVL 2.0, specically the
ViT-MLP-LLM conguration referenced in various existing MLLM studies. For the various sizes of the InternVL2.5
model, we employed dierent visual encoders and large language models, as detailed in the table below.

```
Model Name Vision Part Language Part HF Link MS Link
InternVL2_5-1B InternViT-300M-448px-V2_5 Qwen2.5-0.5B-Instruct link link
InternVL2_5-2B InternViT-300M-448px-V2_5 internlm2_5-1_8b-chat link link
InternVL2_5-4B InternViT-300M-448px-V2_5 Qwen2.5-3B-Instruct link link
InternVL2_5-8B InternViT-300M-448px-V2_5 internlm2_5-7b-chat link link
InternVL2_5-26B InternViT-6B-448px-V2_5 internlm2_5-20b-chat link link
InternVL2_5-38B InternViT-6B-448px-V2_5 Qwen2.5-32B-Instruct link link
InternVL2_5-78B InternViT-6B-448px-V2_5 Qwen2.5-72B-Instruct link link
```
**1.14.1 Mixed Preference Optimization**

We also introduce InternVL2.5-MPO, which isne-tuned with Mixed Preference Optimization (MPO). **These models
outperform their counterparts without MPO by an average of 2 points across all scales on the OpenCompass
leaderboard.**

InternVL2.5-MPO family is built upon the following designs:

- **Multi-Modal Preference Dataset (MMPR)** : We propose an ecient preference data construction pipeline.
    Based on this pipeline, we create MMPR, a high-quality, large-scale multimodal reasoning preference dataset
    containing approximately 3 million samples.
- **Mixed Preference Optimization (MPO)** : We introduce MPO, an eective PO algorithm designed to improve
    the reasoning abilities of MLLMs. The key insight behind this algorithm is that an eective PO process should
    enable the model to learn the relative preference between pairs of responses, the absolute quality of individual
    responses, and the process for generating preferred responses.

```
Model Name Vision Part Language Part HF Link MS Link
InternVL2_5-1B-MPO InternViT-300M-448px-V2_5 Qwen2.5-0.5B-Instruct link link
InternVL2_5-2B-MPO InternViT-300M-448px-V2_5 internlm2_5-1_8b-chat link link
InternVL2_5-4B-MPO InternViT-300M-448px-V2_5 Qwen2.5-3B-Instruct link link
InternVL2_5-8B-MPO InternViT-300M-448px-V2_5 internlm2_5-7b-chat link link
InternVL2_5-26B-MPO InternViT-6B-448px-V2_5 internlm2_5-20b-chat link link
InternVL2_5-38B-MPO InternViT-6B-448px-V2_5 Qwen2.5-32B-Instruct link link
InternVL2_5-78B-MPO InternViT-6B-448px-V2_5 Qwen2.5-72B-Instruct link link
```
**84 Chapter 1. Documentation**


**1.14.2 Citation**

If yound this project useful in your research, please consider citing:

@article{chen2024expanding,
title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model,␣
˓→Data, and Test-Time Scaling},
author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei␣
˓→and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and␣
˓→others},
journal={arXiv preprint arXiv:2412.05271},
year={2024}
}
@article{wang2024mpo,
title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed␣
˓→Preference Optimization},
author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and␣
˓→Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai,␣
˓→Jifeng},
journal={arXiv preprint arXiv:2411.10442},
year={2024}
}

## 1.15 Quick Start of InternVL 2.5 Series

```
Please use transformers>=4.37.2 to ensure the model works normally.
```
**1.15.1 Model Preparation**

```
model name type param download size
InternVL2_5-1B MLLM 0.9B HF link 1.8 GB
InternVL2_5-1B-MPO MLLM 0.9B HF link 1.8 GB
InternVL2_5-2B MLLM 2.2B HF link 4.2 GB
InternVL2_5-2B-MPO MLLM 2.2B HF link 4.2 GB
InternVL2_5-4B MLLM 4.2B HF link 7.8 GB
InternVL2_5-4B-MPO MLLM 4.2B HF link 7.8 GB
InternVL2_5-8B MLLM 8.1B HF link 16 GB
InternVL2_5-8B-MPO MLLM 8.1B HF link 16 GB
InternVL2_5-26B MLLM 25.5B HF link 48 GB
InternVL2_5-26B-MPO MLLM 25.5B HF link 48 GB
InternVL2_5-38B MLLM 40.1B HF link 75 GB
InternVL2_5-38B-MPO MLLM 40.1B HF link 75 GB
InternVL2_5-78B MLLM 76.3B HF link 143 GB
InternVL2_5-78B-MPO MLLM 76.3B HF link 143 GB
```
Download the above model weights according to your need and place them in the pretrained/ folder.

pip install -U huggingface_hub

cd pretrained/
# Download OpenGVLab/InternVL2_5-1B
(continues on next page)

**1.15. Quick Start of InternVL 2.5 Series 85**


```
(continued from previous page)
```
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-1B --local-dir InternVL2_5-1B

# Download OpenGVLab/InternVL2_5-1B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-1B-MPO --local-dir InternVL2_5-1B-MPO

# Download OpenGVLab/InternVL2_5-2B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-2B --local-dir InternVL2_5-2B

# Download OpenGVLab/InternVL2_5-2B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-2B-MPO --local-dir InternVL2_5-2B-MPO

# Download OpenGVLab/InternVL2_5-4B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-4B --local-dir InternVL2_5-4B

# Download OpenGVLab/InternVL2_5-4B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-4B-MPO --local-dir InternVL2_5-4B-MPO

# Download OpenGVLab/InternVL2_5-8B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-8B --local-dir InternVL2_5-8B

# Download OpenGVLab/InternVL2_5-8B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-8B-MPO --local-dir InternVL2_5-8B-MPO

# Download OpenGVLab/InternVL2_5-26B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-26B --local-dir InternVL2_5-26B

# Download OpenGVLab/InternVL2_5-26B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-26B-MPO --local-dir InternVL2_5-26B-MPO

# Download OpenGVLab/InternVL2_5-38B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-38B --local-dir InternVL2_5-38B

# Download OpenGVLab/InternVL2_5-38B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-38B-MPO --local-dir InternVL2_5-38B-MPO

# Download OpenGVLab/InternVL2_5-78B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-78B --local-dir InternVL2_5-78B

# Download OpenGVLab/InternVL2_5-78B-MPO
(continues on next page)

**86 Chapter 1. Documentation**


```
(continued from previous page)
```
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-78B-MPO --local-dir InternVL2_5-78B-MPO

The directory structure is:

pretrained
InternVL2_5-1B
InternVL2_5-1B-MPO
InternVL2_5-2B
InternVL2_5-2B-MPO
InternVL2_5-4B
InternVL2_5-4B-MPO
InternVL2_5-8B
InternVL2_5-8B-MPO
InternVL2_5-26B
InternVL2_5-26B-MPO
InternVL2_5-38B
InternVL2_5-38B-MPO
InternVL2_5-78B
InternVL2_5-78B-MPO

**1.15.2 Model Loading**

**16-bit (bf16 / fp16)**

1B

2B

4B

8B

26B

38B

78B

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2_5-1B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2_5-2B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
(continues on next page)

**1.15. Quick Start of InternVL 2.5 Series 87**


```
(continued from previous page)
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
```
import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2_5-4B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2_5-8B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2_5-26B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2_5-38B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2_5-78B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
(continues on next page)

**88 Chapter 1. Documentation**


```
(continued from previous page)
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
```
**BNB 8-bit Quantization**

1B

2B

4B

8B

26B

38B

78B

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2_5-1B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2_5-2B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2_5-4B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
(continues on next page)

**1.15. Quick Start of InternVL 2.5 Series 89**


```
(continued from previous page)
```
path= "OpenGVLab/InternVL2_5-8B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2_5-26B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2_5-38B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2_5-78B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

**BNB 4-bit Quantization**

1B

2B

4B

8B

26B

**90 Chapter 1. Documentation**


### 38B

### 78B

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2_5-1B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_4bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2_5-2B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_4bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2_5-4B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_4bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2_5-8B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_4bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

```
Warning: Due to signicant quantization errors with BNB 4-bit quantization on InternViT-6B, the model
may produce nonsensical outputs and fail to understand images. Therefore, please avoid using BNB 4-bit
quantization.
Warning: Due to signicant quantization errors with BNB 4-bit quantization on InternViT-6B, the model
may produce nonsensical outputs and fail to understand images. Therefore, please avoid using BNB 4-bit
```
**1.15. Quick Start of InternVL 2.5 Series 91**


```
quantization.
Warning: Due to signicant quantization errors with BNB 4-bit quantization on InternViT-6B, the model
may produce nonsensical outputs and fail to understand images. Therefore, please avoid using BNB 4-bit
quantization.
```
**Multiple GPUs**

The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not
being on the same device. By ensuring that therst and last layers of the large language model (LLM) are on the same
device, we prevent such errors.

import math
import torch
fromtransformersimport AutoTokenizer, AutoModel

defsplit_model(model_name):
device_map= {}
world_size= torch.cuda.device_count()
num_layers= {
'InternVL2_5-1B': 24 , 'InternVL2_5-2B': 24 ,'InternVL2_5-4B': 36 ,'InternVL2_5-8B
˓→': 32 ,
'InternVL2_5-26B': 48 ,'InternVL2_5-38B': 64 ,'InternVL2_5-78B': 80 }[model_name]
# Since the first GPU will be used for ViT, treat it as half a GPU.
num_layers_per_gpu= math.ceil(num_layers /(world_size-0.5))
num_layers_per_gpu= [num_layers_per_gpu]*world_size
num_layers_per_gpu[ 0 ]=math.ceil(num_layers_per_gpu[ 0 ] *0.5)
layer_cnt= 0
fori, num_layerin enumerate(num_layers_per_gpu):
forjin range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] =i
layer_cnt+= 1
device_map['vision_model']= 0
device_map['mlp1']= 0
device_map['language_model.model.tok_embeddings']= 0
device_map['language_model.model.embed_tokens']= 0
device_map['language_model.model.rotary_emb'] = 0
device_map['language_model.output']= 0
device_map['language_model.model.norm']= 0
device_map['language_model.lm_head']= 0
device_map[f'language_model.model.layers.{num_layers- 1 }']= 0

```
return device_map
```
### 1B

### 2B

### 4B

### 8B

### 26B

### 38B

### 78B

**92 Chapter 1. Documentation**


path= "OpenGVLab/InternVL2_5-1B"
device_map =split_model('InternVL2_5-1B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

path= "OpenGVLab/InternVL2_5-2B"
device_map =split_model('InternVL2_5-2B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

path= "OpenGVLab/InternVL2_5-4B"
device_map =split_model('InternVL2_5-4B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

path= "OpenGVLab/InternVL2_5-8B"
device_map =split_model('InternVL2_5-8B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

path= "OpenGVLab/InternVL2_5-26B"
device_map =split_model('InternVL2_5-26B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

path= "OpenGVLab/InternVL2_5-38B"
(continues on next page)

**1.15. Quick Start of InternVL 2.5 Series 93**


```
(continued from previous page)
```
device_map =split_model('InternVL2_5-38B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

path= "OpenGVLab/InternVL2_5-78B"
device_map =split_model('InternVL2_5-78B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

**1.15.3 Inference with Transformers**

import numpyas np
import torch
import torchvision.transformsas T
fromdecord import VideoReader, cpu
fromPILimport Image
fromtorchvision.transforms.functionalimport InterpolationMode
fromtransformersimport AutoModel, AutoTokenizer

IMAGENET_MEAN= (0.485, 0.456,0.406)
IMAGENET_STD=(0.229,0.224, 0.225)

defbuild_transform(input_size):
MEAN, STD= IMAGENET_MEAN, IMAGENET_STD
transform= T.Compose([
T.Lambda(lambda img: img.convert('RGB')if img.mode!= 'RGB'elseimg),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform

deffind_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff =float('inf')
best_ratio= ( 1 , 1 )
area= width*height
forratioin target_ratios:
target_aspect_ratio= ratio[ 0 ]/ratio[ 1 ]
ratio_diff= abs(aspect_ratio-target_aspect_ratio)
if ratio_diff< best_ratio_diff:
best_ratio_diff= ratio_diff
(continues on next page)

**94 Chapter 1. Documentation**


```
(continued from previous page)
best_ratio= ratio
elifratio_diff == best_ratio_diff:
if area>0.5* image_size* image_size* ratio[ 0 ] *ratio[ 1 ]:
best_ratio= ratio
return best_ratio
```
defdynamic_preprocess(image, min_num= 1 , max_num= 12 , image_size= 448 , use_
˓→thumbnail=False):
orig_width, orig_height= image.size
aspect_ratio=orig_width /orig_height

```
# calculate the existing image aspect ratio
target_ratios= set(
(i, j)forn in range(min_num, max_num+ 1 )foriin range( 1 , n+ 1 ) forjin␣
˓→range( 1 , n+ 1 )if
i* j<= max_numandi* j>= min_num)
target_ratios= sorted(target_ratios, key=lambdax: x[ 0 ]* x[ 1 ])
```
```
# find the closest aspect ratio to the target
target_aspect_ratio =find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
```
```
# calculate the target width and height
target_width=image_size *target_aspect_ratio[ 0 ]
target_height= image_size* target_aspect_ratio[ 1 ]
blocks =target_aspect_ratio[ 0 ]* target_aspect_ratio[ 1 ]
```
```
# resize the image
resized_img=image.resize((target_width, target_height))
processed_images=[]
foriin range(blocks):
box=(
(i% (target_width// image_size))*image_size,
(i// (target_width// image_size))* image_size,
((i%(target_width// image_size))+ 1 )* image_size,
((i//(target_width //image_size))+ 1 )* image_size
)
# split the image
split_img= resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) ==blocks
if use_thumbnailand len(processed_images)!= 1 :
thumbnail_img= image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
```
defload_image(image_file, input_size= 448 , max_num= 12 ):
image= Image.open(image_file).convert('RGB')
transform= build_transform(input_size=input_size)
images =dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_
˓→num=max_num)
pixel_values=[transform(image) forimage in images]
(continues on next page)

**1.15. Quick Start of InternVL 2.5 Series 95**


```
(continued from previous page)
pixel_values=torch.stack(pixel_values)
return pixel_values
```
# If you have an 80G A100 GPU, you can put the entire model on a single GPU.
# Otherwise, you need to load a model using multiple GPUs, please refer to the`Multiple␣
˓→GPUs` section.
path= 'OpenGVLab/InternVL2_5-8B'
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer= AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# set the max number of tiles in`max_num`
pixel_values=load_image('./examples/image1.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
generation_config= dict(max_new_tokens= 1024 , do_sample=False)

# pure-text conversation ()
question= 'Hello, who are you?'
response, history= model.chat(tokenizer, None, question, generation_config,␣
˓→history=None, return_history=True)
print(f'User:{question}\nAssistant: {response}')

question= 'Can you tell me a story?'
response, history= model.chat(tokenizer, None, question, generation_config,␣
˓→history=history, return_history=True)
print(f'User:{question}\nAssistant: {response}')

# single-image single-round conversation ()
question= '<image>\nPlease describe the image shortly.'
response= model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User:{question}\nAssistant: {response}')

# single-image multi-round conversation ()
question= '<image>\nPlease describe the image in detail.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,␣
˓→history=None, return_history=True)
print(f'User:{question}\nAssistant: {response}')

question= 'Please write a poem according to the image.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,␣
˓→history=history, return_history=True)
print(f'User:{question}\nAssistant: {response}')

# multi-image multi-round conversation, combined images ()
pixel_values1= load_image('./examples/image1.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
pixel_values2= load_image('./examples/image2.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
pixel_values=torch.cat((pixel_values1, pixel_values2), dim= 0 )

question= '<image>\nDescribe the two images in detail.'
(continues on next page)

**96 Chapter 1. Documentation**


```
(continued from previous page)
```
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
history=None, return_history=True)
print(f'User:{question}\nAssistant: {response}')

question= 'What are the similarities and differences between these two images.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
history=history, return_history=True)
print(f'User:{question}\nAssistant: {response}')

# multi-image multi-round conversation, separate images ()
pixel_values1= load_image('./examples/image1.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
pixel_values2= load_image('./examples/image2.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
pixel_values=torch.cat((pixel_values1, pixel_values2), dim= 0 )
num_patches_list=[pixel_values1.size( 0 ), pixel_values2.size( 0 )]

question= 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list,
history=None, return_history=True)
print(f'User:{question}\nAssistant: {response}')

question= 'What are the similarities and differences between these two images.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list,
history=history, return_history=True)
print(f'User:{question}\nAssistant: {response}')

# batch inference, single image per sample ()
pixel_values1= load_image('./examples/image1.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
pixel_values2= load_image('./examples/image2.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
num_patches_list=[pixel_values1.size( 0 ), pixel_values2.size( 0 )]
pixel_values=torch.cat((pixel_values1, pixel_values2), dim= 0 )

questions= ['<image>\nDescribe the image in detail.'] *len(num_patches_list)
responses= model.batch_chat(tokenizer, pixel_values,
num_patches_list=num_patches_list,
questions=questions,
generation_config=generation_config)
forquestion, responsein zip(questions, responses):
print(f'User:{question}\nAssistant: {response}')

# video multi-round conversation ()
defget_index(bound, fps, max_frame, first_idx= 0 , num_segments= 32 ):
if bound:
start, end= bound[ 0 ], bound[ 1 ]
else:
start, end= - 100000 , 100000
start_idx= max(first_idx,round(start *fps))
end_idx=min(round(end* fps), max_frame)
seg_size=float(end_idx-start_idx) /num_segments
frame_indices= np.array([
int(start_idx+ (seg_size/ 2 )+np.round(seg_size *idx))
(continues on next page)

**1.15. Quick Start of InternVL 2.5 Series 97**


```
(continued from previous page)
foridxin range(num_segments)
])
return frame_indices
```
defload_video(video_path, bound=None, input_size= 448 , max_num= 1 , num_segments= 32 ):
vr =VideoReader(video_path, ctx=cpu( 0 ), num_threads= 1 )
max_frame= len(vr) - 1
fps=float(vr.get_avg_fps())

```
pixel_values_list, num_patches_list= [], []
transform= build_transform(input_size=input_size)
frame_indices= get_index(bound, fps, max_frame, first_idx= 0 , num_segments=num_
˓→segments)
forframe_indexinframe_indices:
img=Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
img=dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_
˓→num=max_num)
pixel_values=[transform(tile)fortilein img]
pixel_values=torch.stack(pixel_values)
num_patches_list.append(pixel_values.shape[ 0 ])
pixel_values_list.append(pixel_values)
pixel_values=torch.cat(pixel_values_list)
return pixel_values, num_patches_list
```
video_path ='./examples/red-panda.mp4'
pixel_values, num_patches_list= load_video(video_path, num_segments= 8 , max_num= 1 )
pixel_values=pixel_values.to(torch.bfloat16).cuda()
video_prefix=''.join([f'Frame{i+ 1 }: <image>\n' foriin range(len(num_patches_list))])
question= video_prefix+'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list, history=None, return_
˓→history=True)
print(f'User:{question}\nAssistant: {response}')

question= 'Describe this video in detail. Don\'t repeat.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list, history=history,␣
˓→return_history=True)
print(f'User:{question}\nAssistant: {response}')

**Streaming Output**

Besides this method, you can also use the following code to get streamed output.

fromtransformersimport TextIteratorStreamer
fromthreading importThread

# Initialize the streamer
streamer= TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True,␣
˓→timeout= 10 )
# Define the generation configuration
(continues on next page)

**98 Chapter 1. Documentation**


```
(continued from previous page)
```
generation_config= dict(max_new_tokens= 1024 , do_sample=False, streamer=streamer)
# Start the model chat in a separate thread
thread =Thread(target=model.chat, kwargs=dict(
tokenizer=tokenizer, pixel_values=pixel_values, question=question,
history=None, return_history=False, generation_config=generation_config,
))
thread.start()

# Initialize an empty string to store the generated text
generated_text= ''
# Loop through the streamer to get the new text as it is generated
fornew_textin streamer:
if new_text== model.conv_template.sep:
break
generated_text+= new_text
print(new_text, end='', flush=True) # Print each new chunk of generated text on the␣
˓→same line

**1.15.4 Citation**

If yound this project useful in your research, please consider citing:

@article{chen2024expanding,
title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model,␣
˓→Data, and Test-Time Scaling},
author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei␣
˓→and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and␣
˓→others},
journal={arXiv preprint arXiv:2412.05271},
year={2024}
}
@article{wang2024mpo,
title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed␣
˓→Preference Optimization},
author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and␣
˓→Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai,␣
˓→Jifeng},
journal={arXiv preprint arXiv:2411.10442},
year={2024}
}

**1.15. Quick Start of InternVL 2.5 Series 99**


## 1.16 Fine-tune on a Custom Dataset

**1.16.1 Model Preparation**

```
model name type param download size
InternVL2_5-1B MLLM 0.9B HF link 1.8 GB
InternVL2_5-1B-MPO MLLM 0.9B HF link 1.8 GB
InternVL2_5-2B MLLM 2.2B HF link 4.2 GB
InternVL2_5-2B-MPO MLLM 2.2B HF link 4.2 GB
InternVL2_5-4B MLLM 4.2B HF link 7.8 GB
InternVL2_5-4B-MPO MLLM 4.2B HF link 7.8 GB
InternVL2_5-8B MLLM 8.1B HF link 16 GB
InternVL2_5-8B-MPO MLLM 8.1B HF link 16 GB
InternVL2_5-26B MLLM 25.5B HF link 48 GB
InternVL2_5-26B-MPO MLLM 25.5B HF link 48 GB
InternVL2_5-38B MLLM 40.1B HF link 75 GB
InternVL2_5-38B-MPO MLLM 40.1B HF link 75 GB
InternVL2_5-78B MLLM 76.3B HF link 143 GB
InternVL2_5-78B-MPO MLLM 76.3B HF link 143 GB
```
Before starting the secondne-tuning, download the pre-trained model we provide.

cd pretrained/
# pip install -U huggingface_hub
# Download OpenGVLab/InternVL2_5-1B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-1B --local-dir InternVL2_5-1B

# Download OpenGVLab/InternVL2_5-1B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-1B-MPO --local-dir InternVL2_5-1B-MPO

# Download OpenGVLab/InternVL2_5-2B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-2B --local-dir InternVL2_5-2B

# Download OpenGVLab/InternVL2_5-2B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-2B-MPO --local-dir InternVL2_5-2B-MPO

# Download OpenGVLab/InternVL2_5-4B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-4B --local-dir InternVL2_5-4B

# Download OpenGVLab/InternVL2_5-4B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-4B-MPO --local-dir InternVL2_5-4B-MPO

# Download OpenGVLab/InternVL2_5-8B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-8B --local-dir InternVL2_5-8B

```
(continues on next page)
```
**100 Chapter 1. Documentation**


```
(continued from previous page)
```
# Download OpenGVLab/InternVL2_5-8B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-8B-MPO --local-dir InternVL2_5-8B-MPO

# Download OpenGVLab/InternVL2_5-26B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-26B --local-dir InternVL2_5-26B

# Download OpenGVLab/InternVL2_5-26B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-26B-MPO --local-dir InternVL2_5-26B-MPO

# Download OpenGVLab/InternVL2_5-38B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-38B --local-dir InternVL2_5-38B

# Download OpenGVLab/InternVL2_5-38B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-38B-MPO --local-dir InternVL2_5-38B-MPO

# Download OpenGVLab/InternVL2_5-78B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-78B --local-dir InternVL2_5-78B

# Download OpenGVLab/InternVL2_5-78B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-78B-MPO --local-dir InternVL2_5-78B-MPO

The directory structure is:

pretrained
InternVL2_5-1B
InternVL2_5-1B-MPO
InternVL2_5-2B
InternVL2_5-2B-MPO
InternVL2_5-4B
InternVL2_5-4B-MPO
InternVL2_5-8B
InternVL2_5-8B-MPO
InternVL2_5-26B
InternVL2_5-26B-MPO
InternVL2_5-38B
InternVL2_5-38B-MPO
InternVL2_5-78B
InternVL2_5-78B-MPO

**1.16.2 Prepare Customized Data**

After downloading the pre-trained model, prepare your customized SFT (Supervised Fine-Tuning) data. Create a JSON
le in internvl_chat/shell/data/ similar tothis example.

The format for the JSONle should be:

**1.16. Fine-tune on a Custom Dataset 101**


### {

"your-custom-dataset-1": {
"root": "path/to/the/image/",
"annotation":"path/to/the/jsonl/annotation",
"data_augment": false,
"max_dynamic_patch": 12 ,
"repeat_time": 1 ,
"length":"number of samples in the dataset"
}
}

Example:

{
"sharegpt4v_instruct_gpt4-vision_cap100k": {
"root": "playground/data/",
"annotation":"playground/opensource/sharegpt4v_instruct_gpt4-vision_cap100k.jsonl",
"data_augment": false,
"max_dynamic_patch": 12 ,
"repeat_time": 1 ,
"length": 102025
}
}

The format for each specic JSONL (such as plain text data, single-image data, multi-image data, video data) can be
organized according to the descriptions provided in _this document_.

My suggestion is to add new domain-specic data on top of the _general data from our open-sourced InternVL 1.2_.
This will enhance downstream capabilities while retaining the foundational skills. Of course, you can also choose to
ne-tune solely on the new data based on your requirements.

**1.16.3 Start 2nd Fine-tuning**

1B

2B

4B

8B

26B

38B

78B

Fine-tune the pre-trained models using either thescript for training the full LLMor thescript for training the LoRA
adapter, depending on your available GPU resources.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL2_5-1B.

In the default settings, I have frozen the visual encoder. You can unfreeze it if needed. Generally, unfreezing the visual
encoder will result in better performance.

```
Fine-tuning the full LLM requires 8x 32G/40G GPUs, whereasne-tuning the LoRA requires 2x 32G/40G
GPUs.
```
**102 Chapter 1. Documentation**


```
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

# Using 8 GPUs, fine-tune the full LLM, cost about 30G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl2.5/2nd_finetune/internvl2_5_1b_dynamic_
˓→res_2nd_finetune_full.sh
# Using 2 GPUs, fine-tune the LoRA, cost about 27G per GPU
GPUS= 2 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl2.5/2nd_finetune/internvl2_5_1b_dynamic_
˓→res_2nd_finetune_lora.sh
# Using 8 GPUs, fine-tune the LoRA, cost about 27G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl2.5/2nd_finetune/internvl2_5_1b_dynamic_
˓→res_2nd_finetune_lora.sh

Fine-tune the pre-trained models using either thescript for training the full LLMor thescript for training the LoRA
adapter, depending on your available GPU resources.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL2_5-2B.

In the default settings, I have frozen the visual encoder. You can unfreeze it if needed. Generally, unfreezing the visual
encoder will result in better performance.

```
Fine-tuning the full LLM requires 8x 32G/40G GPUs, whereasne-tuning the LoRA requires 2x 32G/40G
GPUs.
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

# Using 8 GPUs, fine-tune the full LLM, cost about 30G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl2.5/2nd_finetune/internvl2_5_2b_dynamic_
˓→res_2nd_finetune_full.sh
# Using 2 GPUs, fine-tune the LoRA, cost about 27G per GPU
GPUS= 2 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl2.5/2nd_finetune/internvl2_5_2b_dynamic_
˓→res_2nd_finetune_lora.sh
# Using 8 GPUs, fine-tune the LoRA, cost about 27G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl2.5/2nd_finetune/internvl2_5_2b_dynamic_
˓→res_2nd_finetune_lora.sh

Fine-tune the pre-trained models using either thescript for training the full LLMor thescript for training the LoRA
adapter, depending on your available GPU resources.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL2_5-4B.

In the default settings, I have frozen the visual encoder. You can unfreeze it if needed. Generally, unfreezing the visual
encoder will result in better performance.

```
Fine-tuning the full LLM requires 8x 40G GPUs, whereasne-tuning the LoRA requires 2x 24G GPUs.
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

**1.16. Fine-tune on a Custom Dataset 103**


# Using 8 GPUs, fine-tune the full LLM, cost about 40G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl2.5/2nd_finetune/internvl2_5_4b_dynamic_
˓→res_2nd_finetune_full.sh
# Using 2 GPUs, fine-tune the LoRA, cost about 19G per GPU
GPUS= 2 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl2.5/2nd_finetune/internvl2_5_4b_dynamic_
˓→res_2nd_finetune_lora.sh
# Using 8 GPUs, fine-tune the LoRA, cost about 19G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl2.5/2nd_finetune/internvl2_5_4b_dynamic_
˓→res_2nd_finetune_lora.sh

Fine-tune the pre-trained models using either thescript for training the full LLMor thescript for training the LoRA
adapter, depending on your available GPU resources.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL2_5-8B.

In the default settings, I have frozen the visual encoder. You can unfreeze it if needed. Generally, unfreezing the visual
encoder will result in better performance.

```
Fine-tuning the full LLM requires 8 A100 80G GPUs, whereas ne-tuning the LoRA requires 2 A100
80G GPUs.
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

# Using 8 GPUs, fine-tune the full LLM, cost about 77G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl2.5/2nd_finetune/internvl2_5_8b_dynamic_
˓→res_2nd_finetune_full.sh
# Using 2 GPUs, fine-tune the LoRA, cost about 79G per GPU
GPUS= 2 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl2.5/2nd_finetune/internvl2_5_8b_dynamic_
˓→res_2nd_finetune_lora.sh
# Using 8 GPUs, fine-tune the LoRA, cost about 60G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl2.5/2nd_finetune/internvl2_5_8b_dynamic_
˓→res_2nd_finetune_lora.sh

Fine-tune the pre-trained models using either thescript for training the full LLMor thescript for training the LoRA
adapter, depending on your available GPU resources.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL2_5-26B.

In the default settings, I have frozen the visual encoder. You can unfreeze it if needed. Generally, unfreezing the visual
encoder will result in better performance.

```
Fine-tuning the full LLM requires 8 A100 80G GPUs, whereas ne-tuning the LoRA requires 2 A100
80G GPUs.
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

# Using 8 GPUs, fine-tune the full LLM, cost about 77G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl2.5/2nd_finetune/internvl2_5_26b_dynamic_
˓→res_2nd_finetune_full.sh
# Using 2 GPUs, fine-tune the LoRA, cost about 79G per GPU
(continues on next page)

**104 Chapter 1. Documentation**


```
(continued from previous page)
```
GPUS= 2 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl2.5/2nd_finetune/internvl2_5_26b_dynamic_
˓→res_2nd_finetune_lora.sh
# Using 8 GPUs, fine-tune the LoRA, cost about 60G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl2.5/2nd_finetune/internvl2_5_26b_dynamic_
˓→res_2nd_finetune_lora.sh

Fine-tune the pre-trained models using either thescript for training the full LLMor thescript for training the LoRA
adapter, depending on your available GPU resources.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL2_5-38B.

In the default settings, I have frozen the visual encoder. You can unfreeze it if needed. Generally, unfreezing the visual
encoder will result in better performance.

```
Fine-tuning the full LLM requires 16 A100 80G GPUs, whereasne-tuning the LoRA requires 2 A100
80G GPUs.
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

# Using 16 GPUs with SLURM system, fine-tune the full LLM, cost about 77G per GPU
PARTITION='your partition'GPUS= 16 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl2.5/2nd_
˓→finetune/internvl2_5_38b_dynamic_res_2nd_finetune_full.sh
# Using 2 GPUs, fine-tune the LoRA, cost about 74G per GPU
GPUS= 2 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl2.5/2nd_finetune/internvl2_5_38b_dynamic_
˓→res_2nd_finetune_lora.sh
# Using 8 GPUs, fine-tune the LoRA, cost about 74G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl2.5/2nd_finetune/internvl2_5_38b_dynamic_
˓→res_2nd_finetune_lora.sh

Fine-tune the pre-trained models using either thescript for training the full LLMor thescript for training the LoRA
adapter, depending on your available GPU resources.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL2_5-78B.

In the default settings, I have frozen the visual encoder. You can unfreeze it if needed. Generally, unfreezing the visual
encoder will result in better performance.

```
Fine-tuning the full LLM requires 32 A100 80G GPUs, whereasne-tuning the LoRA requires 8 A100
80G GPUs.
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

# Using 32 GPUs with SLURM system, fine-tune the full LLM, cost about 77G per GPU
PARTITION='your partition'GPUS= 32 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl2.5/2nd_
˓→finetune/internvl2_5_78b_dynamic_res_2nd_finetune_full.sh
# Using 8 GPUs, fine-tune the LoRA, cost about 74G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh internvl2.5/2nd_finetune/internvl2_5_78b_dynamic_res_
˓→2nd_finetune_lora.sh

If you encounter any issues, please let me know, and I will update the training guide to enhance its usability.

**1.16. Fine-tune on a Custom Dataset 105**


**1.16.4 Citation**

If yound this project useful in your research, please consider citing:

@article{chen2024expanding,
title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model,␣
˓→Data, and Test-Time Scaling},
author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei␣
˓→and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and␣
˓→others},
journal={arXiv preprint arXiv:2412.05271},
year={2024}
}
@article{wang2024mpo,
title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed␣
˓→Preference Optimization},
author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and␣
˓→Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai,␣
˓→Jifeng},
journal={arXiv preprint arXiv:2411.10442},
year={2024}
}

## 1.17 Evaluation of InternVL2.5 Series

To evaluate the performance of the InternVL2.5 series across various tasks, follow the instructions for each specic
dataset. Ensure that the appropriate number of GPUs is allocated as specied.

```
1 We mainly use VLMEvalKit repositories for model evaluation.
2 Please note that evaluating the same model using dierent testing toolkits like InternVL and VLMEvalKit
can result in slight dierences, which is normal. Updates to code versions and variations in environment
and hardware can also cause minor discrepancies in results.
```
**1.17.1 Model Preparation**

```
model name type param download size
InternVL2_5-1B MLLM 0.9B HF link 1.8 GB
InternVL2_5-1B-MPO MLLM 0.9B HF link 1.8 GB
InternVL2_5-2B MLLM 2.2B HF link 4.2 GB
InternVL2_5-2B-MPO MLLM 2.2B HF link 4.2 GB
InternVL2_5-4B MLLM 4.2B HF link 7.8 GB
InternVL2_5-4B-MPO MLLM 4.2B HF link 7.8 GB
InternVL2_5-8B MLLM 8.1B HF link 16 GB
InternVL2_5-8B-MPO MLLM 8.1B HF link 16 GB
InternVL2_5-26B MLLM 25.5B HF link 48 GB
InternVL2_5-26B-MPO MLLM 25.5B HF link 48 GB
InternVL2_5-38B MLLM 40.1B HF link 75 GB
InternVL2_5-38B-MPO MLLM 40.1B HF link 75 GB
InternVL2_5-78B MLLM 76.3B HF link 143 GB
InternVL2_5-78B-MPO MLLM 76.3B HF link 143 GB
```
Before evaluation, download the trained model we provide.

**106 Chapter 1. Documentation**


cd pretrained/
# pip install -U huggingface_hub
# Download OpenGVLab/InternVL2_5-1B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-1B --local-dir InternVL2_5-1B

# Download OpenGVLab/InternVL2_5-1B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-1B-MPO --local-dir InternVL2_5-1B-MPO

# Download OpenGVLab/InternVL2_5-2B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-2B --local-dir InternVL2_5-2B

# Download OpenGVLab/InternVL2_5-2B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-2B-MPO --local-dir InternVL2_5-2B-MPO

# Download OpenGVLab/InternVL2_5-4B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-4B --local-dir InternVL2_5-4B

# Download OpenGVLab/InternVL2_5-4B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-4B-MPO --local-dir InternVL2_5-4B-MPO

# Download OpenGVLab/InternVL2_5-8B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-8B --local-dir InternVL2_5-8B

# Download OpenGVLab/InternVL2_5-8B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-8B-MPO --local-dir InternVL2_5-8B-MPO

# Download OpenGVLab/InternVL2_5-26B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-26B --local-dir InternVL2_5-26B

# Download OpenGVLab/InternVL2_5-26B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-26B-MPO --local-dir InternVL2_5-26B-MPO

# Download OpenGVLab/InternVL2_5-38B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-38B --local-dir InternVL2_5-38B

# Download OpenGVLab/InternVL2_5-38B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-38B-MPO --local-dir InternVL2_5-38B-MPO

# Download OpenGVLab/InternVL2_5-78B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-78B --local-dir InternVL2_5-78B
(continues on next page)

**1.17. Evaluation of InternVL2.5 Series 107**


```
(continued from previous page)
```
# Download OpenGVLab/InternVL2_5-78B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-78B-MPO --local-dir InternVL2_5-78B-MPO

The directory structure is:

pretrained
InternVL2_5-1B
InternVL2_5-1B-MPO
InternVL2_5-2B
InternVL2_5-2B-MPO
InternVL2_5-4B
InternVL2_5-4B-MPO
InternVL2_5-8B
InternVL2_5-8B-MPO
InternVL2_5-26B
InternVL2_5-26B-MPO
InternVL2_5-38B
InternVL2_5-38B-MPO
InternVL2_5-78B
InternVL2_5-78B-MPO

**1.17.2 Evaluation using VLMEvalKit Codebase**

We evaluate the performance on most benchmarks ( _e.g._ , MMVet, LLaVABench, and CRPE) usingVLMEvalKit. You
need to set and USE_COT="1" in environment variable to activate the CoT prompt.

**Data Preparation**

VLMEvalKit will automatically download the data for evaluation, so you do not need to prepare it manually.

**Evaluation on Dierent Benchmarks**

To evaluate our models on dierent benchmarks, you can refer to the following script:

#!/bin/bash
set-x
PARTITION=${PARTITION:-"Intern5"}
GPUS=${GPUS:- 64 }
GPUS_PER_NODE=${GPUS_PER_NODE:- 8 }
GPUS_PER_TASK=${GPUS_PER_TASK:- 1 }
QUOTA_TYPE=${QUOTA_TYPE:-"reserved"}

declare-a models=( \
"InternVL2-5-1B" \
"InternVL2-5-1B-MPO" \
"InternVL2-5-2B" \
"InternVL2-5-2B-MPO" \
"InternVL2-5-4B" \
"InternVL2-5-4B-MPO" \
"InternVL2-5-8B" \
"InternVL2-5-8B-MPO" \
(continues on next page)

**108 Chapter 1. Documentation**


(continued from previous page)
"InternVL2-5-38B" \
"InternVL2-5-38B-MPO" \
"InternVL2-5-78B" \
"InternVL2-5-78B-MPO" \
)

datasets="MMBench_TEST_EN_V11 MMStar MMMU_DEV_VAL MathVista_MINI HallusionBench AI2D_
˓→TEST OCRBench MMVet"
LOG_DIR="logs_eval"

export OPENAI_API_KEY="xxx"

for((i= 0 ; i<${#models[@]}; i++));do

```
model=${models[i]}
```
```
if [[ "$model"=~ 38B|78B ]]; then
GPUS_PER_TASK= 8
else
GPUS_PER_TASK= 1
fi
```
```
srun -p${PARTITION} \
--gres=gpu:${GPUS_PER_NODE} \
--ntasks=$((GPUS/GPUS_PER_TASK)) \
--ntasks-per-node=$((GPUS_PER_NODE/GPUS_PER_TASK)) \
--quotatype=${QUOTA_TYPE}\
--job-name="eval_wwy" \
-o "${LOG_DIR}/${model}/evaluation.log" \
-e "${LOG_DIR}/${model}/evaluation.log" \
--async\
python -u run.py\
--data ${datasets} \
--model${model}\
--verbose\
```
done

Note that VLMEvalkit does not ocially support launching evaluation tasks with Slurm. You need to modify the
run.pyscript to support the Slurm launcher as follows:

definit_dist():
if 'RANK' inos.environand'WORLD_SIZE' inos.environ:
pass
elif'SLURM_PROCID' in os.environ:
rank=int(os.getenv('SLURM_PROCID',' 0 '))
world_size= int(os.getenv('SLURM_NTASKS', ' 1 '))
local_rank= rank%torch.cuda.device_count()

```
os.environ['RANK']= str(rank)
os.environ['LOCAL_RANK'] =str(local_rank)
os.environ['WORLD_SIZE'] =str(world_size)
(continues on next page)
```
**1.17. Evaluation of InternVL2.5 Series 109**


```
(continued from previous page)
```
```
if 'MASTER_ADDR'not in os.environ:
node_list= os.environ["SLURM_NODELIST"]
addr=subprocess.getoutput(f"scontrol show hostname {node_list}| head -n1")
os.environ['MASTER_ADDR'] =addr
if 'MASTER_PORT'not in os.environ:
os.environ['MASTER_PORT'] =' 22110 '
```
...

if __name__== '__main__':
load_env()
init_dist()
main()

Please refer to theirdocumentfor more details.

**1.17.3 Citation**

If yound this project useful in your research, please consider citing:

@article{chen2024expanding,
title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model,␣
˓→Data, and Test-Time Scaling},
author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei␣
˓→and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and␣
˓→others},
journal={arXiv preprint arXiv:2412.05271},
year={2024}
}
@article{wang2024mpo,
title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed␣
˓→Preference Optimization},
author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and␣
˓→Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai,␣
˓→Jifeng},
journal={arXiv preprint arXiv:2411.10442},
year={2024}
}

## 1.18 Deploy InternVL2.5 Series

**1.18.1 LMDeploy**

LMDeployis a toolkit for compressing, deploying, and serving LLMs & VLMs.

pip install lmdeploy>= 0 .6.4 --no-deps

LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-
use pipeline, similar to the Large Language Model (LLM) inference pipeline.

**110 Chapter 1. Documentation**


**A ‘Hello, World’ Example**

1B

2B

4B

8B

26B

38B

78B

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2_5-1B'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))
response= pipe(('describe this image', image))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2_5-2B'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))
response= pipe(('describe this image', image))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2_5-4B'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))
response= pipe(('describe this image', image))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2_5-8B'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))
response= pipe(('describe this image', image))
print(response.text)

**1.18. Deploy InternVL2.5 Series 111**


fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2_5-26B'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))
response= pipe(('describe this image', image))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2_5-38B'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 , tp= 2 ))
response= pipe(('describe this image', image))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2_5-78B'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 , tp= 4 ))
response= pipe(('describe this image', image))
print(response.text)

If ImportError occurs while executing this case, please install the required packages as prompted.

**Multi-Images Inference**

When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a
higher number of input tokens, and as a result, the size of the context window typically needs to be increased.

```
Warning: Due to the scarcity of multi-image conversation data, the performance on multi-image tasks may
be unstable, and it may require multiple attempts to achieve satisfactory results.
```
1B

2B

4B

8B

26B

38B

78B

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
(continues on next page)

**112 Chapter 1. Documentation**


```
(continued from previous page)
```
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/InternVL2_5-1B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/InternVL2_5-2B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/InternVL2_5-4B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
(continues on next page)

**1.18. Deploy InternVL2.5 Series 113**


```
(continued from previous page)
```
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/InternVL2_5-8B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/InternVL2_5-26B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/InternVL2_5-38B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
(continues on next page)

**114 Chapter 1. Documentation**


(continued from previous page)
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/InternVL2_5-78B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 , tp= 4 ))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

**Batch Prompts Inference**

Conducting inference with batch prompts is quite straightforward; just place them within a list structure:

1B

2B

4B

8B

26B

38B

78B

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2_5-1B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
(continues on next page)

**1.18. Deploy InternVL2.5 Series 115**


(continued from previous page)
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2_5-2B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2_5-4B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2_5-8B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

**116 Chapter 1. Documentation**


fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2_5-26B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2_5-38B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2_5-78B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 , tp= 4 ))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

**Multi-Turn Conversation**

There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the
format of OpenAI and use above introduced method, the other is to use the pipeline.chat interface.

1B

2B

**1.18. Deploy InternVL2.5 Series 117**


### 4B

### 8B

### 26B

### 38B

### 78B

fromlmdeployimport pipeline, TurbomindEngineConfig, GenerationConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2_5-1B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig, GenerationConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2_5-2B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig, GenerationConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2_5-4B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig, GenerationConfig
fromlmdeploy.vlimport load_image

```
(continues on next page)
```
**118 Chapter 1. Documentation**


```
(continued from previous page)
```
model= 'OpenGVLab/InternVL2_5-8B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig, GenerationConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2_5-26B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig, GenerationConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2_5-38B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig, GenerationConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2_5-78B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 , tp= 4 ))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
(continues on next page)

**1.18. Deploy InternVL2.5 Series 119**


```
(continued from previous page)
```
print(sess.response.text)

**Serving**

**Launch Service**

1B

2B

4B

8B

26B

38B

78B

LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

lmdeploy serve api_server OpenGVLab/InternVL2_5-1B --backend turbomind --server-port␣
˓→ 23333

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

lmdeploy serve api_server OpenGVLab/InternVL2_5-2B --backend turbomind --server-port␣
˓→ 23333

You can also load 4-bit AWQ quantized models to save memory:

lmdeploy serve api_server OpenGVLab/InternVL2_5-2B-AWQ --backend turbomind --server-port␣
˓→ 23333 --model-format awq

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

lmdeploy serve api_server OpenGVLab/InternVL2_5-4B --backend turbomind --server-port␣
˓→ 23333

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

**120 Chapter 1. Documentation**


lmdeploy serve api_server OpenGVLab/InternVL2_5-8B --backend turbomind --server-port␣
˓→ 23333

You can also load 4-bit AWQ quantized models to save memory:

lmdeploy serve api_server OpenGVLab/InternVL2_5-8B-AWQ --backend turbomind --server-port␣
˓→ 23333 --model-format awq

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

```
Warning: Please make sure to install Flash Attention; otherwise, using --tp will cause errors.
```
LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

lmdeploy serve api_server OpenGVLab/InternVL2_5-26B --backend turbomind --server-port␣
˓→ 23333

You can also load 4-bit AWQ quantized models to save memory:

lmdeploy serve api_server OpenGVLab/InternVL2_5-26B-AWQ --backend turbomind --server-
˓→port 23333 --model-format awq

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

```
Warning: Please make sure to install Flash Attention; otherwise, using --tp will cause errors.
```
LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

lmdeploy serve api_server OpenGVLab/InternVL2_5-38B --backend turbomind --server-port␣
˓→ 23333 --tp 2

You can also load 4-bit AWQ quantized models to save memory:

lmdeploy serve api_server OpenGVLab/InternVL2_5-38B-AWQ --backend turbomind --server-
˓→port 23333 --model-format awq

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

```
Warning: Please make sure to install Flash Attention; otherwise, using --tp will cause errors.
```
LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

CUDA_VISIBLE_DEVICES= 0 ,1,2,3 lmdeploy serve api_server OpenGVLab/InternVL2_5-78B --
˓→backend turbomind --server-port 23333 --tp 4

You can also load 4-bit AWQ quantized models to save memory:

lmdeploy serve api_server OpenGVLab/InternVL2_5-78B-AWQ --backend turbomind --server-
˓→port 23333 --model-format awq

**1.18. Deploy InternVL2.5 Series 121**


The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

**Integrate with** OpenAI

Here is an example of interaction with the endpoint v1/chat/completions service via the openai package. Before
running it, please install the openai package by pip install openai.

fromopenai import OpenAI

client =OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name =client.models.list().data[ 0 ].id
response= client.chat.completions.create(
model=model_name,
messages=[{
'role':
'user',
'content': [{
'type': 'text',
'text': 'describe this image',
}, {
'type': 'image_url',
'image_url': {
'url':
'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
},
}],
}],
temperature=0.8,
top_p=0.8)
print(response)

If you encounter any issues or need advanced usage with lmdeploy, we recommend reading thelmdeploy documen-
tation.

**Memory Usage Testing**

To test the memory usage with several A100 GPUs, we will consider the following variables: the number of GPUs,
whether AWQ 4-bit quantization is used, and the size of --cache-max-entry-count. The table below shows the
memory usage per GPU under dierent scenarios:

2B

4B

8B

26B

38B

78B

**122 Chapter 1. Documentation**


```
#GPUs AWQ 4-bit cache-max-entry-count Memory Usage per GPU
1 No 0.8 67140 MB
1 No 0.2 21284 MB
1 No 0.1 13700 MB
1 No 0.05 9860 MB
2 No 0.05 8612 MB
4 No 0.05 7916 MB
```
```
1 Yes 0.2 19242 MB
1 Yes 0.05 7850 MB
```
```
#GPUs AWQ 4-bit cache-max-entry-count Memory Usage per GPU
1 No 0.8 67666 MB
1 No 0.2 24914 MB
1 No 0.1 17746 MB
1 No 0.05 14162 MB
2 No 0.05 11700 MB
4 No 0.05 10216 MB
```
```
#GPUs AWQ 4-bit cache-max-entry-count Memory Usage per GPU
1 No 0.8 69708 MB
1 No 0.2 30624 MB
1 No 0.1 24108 MB
1 No 0.05 20832 MB
2 No 0.05 14570 MB
4 No 0.05 11378 MB
```
```
1 Yes 0.2 22440 MB
1 Yes 0.05 11528 MB
```
```
Warning: Please make sure to install Flash Attention; otherwise, using --tp will cause errors.
```
```
#GPUs AWQ 4-bit cache-max-entry-count Memory Usage per GPU
1 No 0.8 77310 MB
1 No 0.2 58302 MB
2 No 0.2 39750 MB
4 No 0.2 30512 MB
8 No 0.2 25806 MB
```
```
1 Yes 0.8 72104 MB
1 Yes 0.2 37448 MB
1 Yes 0.1 31656 MB
1 Yes 0.05 28712 MB
```
```
Warning: Please make sure to install Flash Attention; otherwise, using --tp will cause errors.
```
**1.18. Deploy InternVL2.5 Series 123**


```
#GPUs AWQ 4-bit cache-max-entry-count Memory Usage per GPU
1 No 0.8 Out-Of-Memory
1 No 0.2 79892 MB
1 No 0.1 No output
1 No 0.05 No output
2 No 0.2 50156 MB
4 No 0.2 34990 MB
8 No 0.2 28052 MB
```
```
1 Yes 0.8 72964 MB
1 Yes 0.2 42628 MB
1 Yes 0.1 37572 MB
1 Yes 0.05 33636 MB
```
```
Warning: Please make sure to install Flash Attention; otherwise, using --tp will cause errors.
```
```
#GPUs AWQ 4-bit cache-max-entry-count Memory Usage per GPU
1 No 0.8 Out-Of-Memory
1 No 0.2 Out-Of-Memory
2 No 0.2 78698 MB
4 No 0.2 50477 MB
8 No 0.2 36268 MB
```
```
1 Yes 0.8 77138 MB
1 Yes 0.2 58738 MB
1 Yes 0.1 55698 MB
1 Yes 0.05 54130 MB
```
**1.18.2 vLLM**

Coming soon...

**1.18.3 Ollama**

Coming soon...

## 1.19 Mixed Preference Optimization

```
Please use trl==0.10.1 to ensure the model works normally.
```
**124 Chapter 1. Documentation**


**1.19.1 Model Preparation**

```
model name type param download size
InternVL2_5-1B MLLM 0.9B HF link 1.8 GB
InternVL2_5-1B-MPO MLLM 0.9B HF link 1.8 GB
InternVL2_5-2B MLLM 2.2B HF link 4.2 GB
InternVL2_5-2B-MPO MLLM 2.2B HF link 4.2 GB
InternVL2_5-4B MLLM 4.2B HF link 7.8 GB
InternVL2_5-4B-MPO MLLM 4.2B HF link 7.8 GB
InternVL2_5-8B MLLM 8.1B HF link 16 GB
InternVL2_5-8B-MPO MLLM 8.1B HF link 16 GB
InternVL2_5-26B MLLM 25.5B HF link 48 GB
InternVL2_5-26B-MPO MLLM 25.5B HF link 48 GB
InternVL2_5-38B MLLM 40.1B HF link 75 GB
InternVL2_5-38B-MPO MLLM 40.1B HF link 75 GB
InternVL2_5-78B MLLM 76.3B HF link 143 GB
InternVL2_5-78B-MPO MLLM 76.3B HF link 143 GB
```
Before starting the preference optimization, download the pre-trained model we provide.

cd ckpt/
# pip install -U huggingface_hub
# Download OpenGVLab/InternVL2_5-8B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-8B --local-dir InternVL2_5-8B
# Download OpenGVLab/InternVL2_5-8B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2_5-8B-MPO --local-dir InternVL2_5-8B-MPO

The directory structure is:

ckpt
InternVL2_5-8B
InternVL2_5-8B-MPO

**1.19.2 Prepare Our MMPR Dataset**

To prepare the training data, pleaserst download ourMMPR datasetandthe JSONle.

Our dataset contains approximately 3 million preference pairs, of which only around 400k are utilized during training.
You can adjust the number of active data samples and the data mixture ratio by modifying the repeat parameter in the
JSONle.

The directory structure is:

MMPR
images
annotations

Please note that our training data includes instructions collected fromInternVL demo. However, due to privacy pro-
tection concerns, we are unable to release these portion of the data. Therefore, the reproduced results on general VQA
( _i.e._ , MMVet, LLaVABench, and MMHal-Bench) may be inferior toour released model.

We recommend incorporating additional general VQA data to preserve the general VQA abilities, following _our
DropoutNTP pipeline_.

**1.19. Mixed Preference Optimization 125**


**1.19.3 Prepare Customized Data**

If you want to prepare your customized preference data, please create a JSONle similar tothis example.

The format for the JSONle should be:

{
"your-custom-dataset-1": {
"root": "path/to/the/image/",
"annotation":"path/to/the/jsonl/annotation",
"data_augment": false,
"max_dynamic_patch": 12 ,
"repeat_time": 1 ,
"length":"number of samples in the dataset"
}
}

Example:

{
"scienceqa_multi_choice_en_20240402_extracted_pairs_vqa_format_rules": {
"root": "MMPR/images/ScienceQA",
"annotation":"MMPR/annotations/scienceqa_multi_choice_en_20240402_extracted_pairs_
˓→vqa_format_rules.jsonl",
"data_augment": false,
"repeat_time": 1 ,
"length": 66457
}
}

The format for each specic JSONL (such as plain text data, single-image data, multi-image data) can be organized as
the following format:

{"image":"1.png","question":"xxx", "chosen":"xxx", "rejected":"xxx",}
{"image":"2.png","question":"xxx", "chosen":"xxx", "rejected":"xxx",}
...

Our suggestion is to add new domain-specic data on top ofMMPR. This will enhance downstream capabilities while
retaining the foundational skills. Of course, you can also choose to ne-tune solely on the new data based on your
requirements.

**1.19.4 Start Preference Optimization**

Commands for preference optimization:

cd internvl_chat
sh shell/internvl2.5_mpo/preference_optimization/internvl2_5_8b_internlm2_5_7b_dynamic_
˓→res_mpo.sh

If you encounter any issues, please let us know, and we will update the training guide to enhance its usability.

```
Based on the environment of InternVL, you need to additionally run pip install trl==0.10.1.
```
**126 Chapter 1. Documentation**


**1.19.5 Evaluation**

We evaluate the performance on other benchmarks ( _e.g._ , MMVet, LLaVABench, and CRPE) usingVLMEvalKit. You
need to set use_mpo_prompt=True incong.pyand USE_COT="1" in environment variable to activate the CoT prompt.

**1.19.6 Generate Additional Preference Data**

To construct additional open-ended VQA preference data, you can use ourDropoutNTP pipelinewith the following
command:

srun -p${PARTITION}\
--gres=gpu:${GPUS_PER_NODE} \
--nodes=${NODES}\
--ntasks=${GPUS}\
--ntasks-per-node=${GPUS_PER_NODE}\
--cpus-per-task=${CPUS_PER_TASK} \
--kill-on-bad-exit= 1 \
--quotatype=${QUOTA_TYPE}\
python -u tools/reasoning_data_pipeline/mmpr_data_pipeline_dropout_ntp.py\
--checkpoint${model_path}\ # the model you want to use to generate negative␣
˓→samples
--prompt-path${dataset} \ # please refer to the following format example
--out-dir${out_dir}\ # the output directory you want to save the resulting data
--batch-size 1 \
--num-workers 8 \
--num-return-sequences 1 \ # the number of generated negative samples per item
--top-k 50 \
--temperature 1 .0 \
--dynamic\
--max-num${max_num}\ # max_tiles when enabling dynamic resolution
--sample-max-num 500000 \
--tp 8 \
--start-ratio${START_RATIO} \ # We set it to 0.5 by default
2 >& 1 | tee -a"${LOG_PATH}" # the file path you want to save your log

The format for the promptle should be:

{"image":"1.png","question":"xxx", "chosen":"xxx", "rejected":null,}
{"image":"2.png","question":"xxx", "chosen":"xxx", "rejected":null,}
...

To constrct additional CoT reasoning preference data, you can use ourcorrectness-based pipelinewith the following
command:

srun -p${PARTITION}\
--gres=gpu:${GPUS_PER_NODE} \
--nodes=${NODES}\
--ntasks=${GPUS}\
--ntasks-per-node=${GPUS_PER_NODE}\
--cpus-per-task=${CPUS_PER_TASK} \
--kill-on-bad-exit= 1 \
--quotatype=${QUOTA_TYPE}\
python -u tools/reasoning_data_pipeline/mmpr_data_pipeline_correctness.py\
--checkpoint${model_path}\ # the model you want to use to generate negative␣
(continues on next page)

**1.19. Mixed Preference Optimization 127**


(continued from previous page)
˓→samples
--prompt-path${dataset} \ # please refer to the following format example
--out-dir${out_dir}\ # the output directory you want to save the resulting data
--batch-size 1 \
--num-workers 8 \
--num-return-sequences 32 \ # the number of generated reasoning processes per item
--top-k 50 \
--temperature 1 .0 \
--dynamic\
--max-num${max_num}\ # max_tiles when enabling dynamic resolution
--sample-max-num 20000 \
--tp 8 \
2 >& 1 | tee -a"${LOG_PATH}" # the file path you want to save your log

The format for the promptle should be:

{"image":"1.png","question":"xxx", "answer":"xxx"}
{"image":"2.png","question":"xxx", "answer":"xxx"}
...

After sample multiple reasoning processes, you can use this command to convert them into preference data based on
the correctness:

python -u tools/mm_reasoning_pipeline/internvl_lmdeploy_cot_postprocess.py\
--data-dir"${data_dir}" \ # should be same with the ${out_dir} when sampling␣
˓→reasoning processes
--save-dir"${save_dir}" \ # the output directory you want to save the resulting␣
˓→data
--answer-fix\
--force\
--num-pairs-per-key 15 \
--max-lines 1200000 \

**1.19.7 Citation**

If yound this project useful in your research, please consider citing:

@article{wang2024mpo,
title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed␣
˓→Preference Optimization},
author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and␣
˓→Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai,␣
˓→Jifeng},
journal={arXiv preprint arXiv:2411.10442},
year={2024}
}
@article{chen2024expanding,
title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model,␣
˓→Data, and Test-Time Scaling},
author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei␣
˓→and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and␣
˓→others},
(continues on next page)

**128 Chapter 1. Documentation**


(continued from previous page)
journal={arXiv preprint arXiv:2412.05271},
year={2024}
}

## 1.20 Introduction of InternVL2 Series

We are excited to announce the release of InternVL 2.0, the latest addition to the InternVL series of multimodal large
language models. InternVL 2.0 features a variety of instruction-tuned models, ranging from 1 billion to 108 billion
parameters.

Compared to the state-of-the-art open-source multimodal large language models, InternVL 2.0 surpasses most open-
source models. It demonstrates competitive performance on par with proprietary commercial models across various
capabilities, including document and chart comprehension, infographics QA, scene text understanding and OCR tasks,
scientic and mathematical problem solving, as well as cultural understanding and integrated multimodal capabilities.

InternVL 2.0 is trained with an 8k context window and utilizes training data consisting of **long texts, multiple images,
medical data, and videos** , signicantly improving its ability to handle these types of inputs compared to InternVL 1.5.
For more details, please refer to ourblogandGitHub.

**1.20. Introduction of InternVL2 Series 129**


As shown in thisgure, InternVL2 utilizes the same architecture as InternVL 1.5, specically the ViT-MLP-LLM con-
guration referenced in various existing studies. For the various sizes of the InternVL2 model, we employed dierent
visual encoders and large language models, as detailed in the table below.

```
Model Name Vision Part Language Part HF Link MS Link
InternVL2-1B InternViT-300M-448px Qwen2-0.5B-Instruct link link
InternVL2-2B InternViT-300M-448px internlm2-chat-1-8b link link
InternVL2-4B InternViT-300M-448px Phi-3-mini-128k-instruct link link
InternVL2-8B InternViT-300M-448px internlm2_5-7b-chat link link
InternVL2-26B InternViT-6B-448px-V1-5 internlm2-chat-20b link link
InternVL2-40B InternViT-6B-448px-V1-5 Nous-Hermes-2-Yi-34B link link
InternVL2-Llama3-76B InternViT-6B-448px-V1-5 Hermes-2-Theta-Llama-3-70B link link
```
During training, we implemented a dynamic resolution strategy, dividing images into tiles of 448× 448 pixels in sizes
ranging from 1 to 12, based on the aspect ratio and resolution of the input images. During testing, this can be zero-shot
scaled up to 40 tiles (i.e., 4K resolution). To enhance scalability for high resolution, we simply employed a pixel shue
(unshue) operation to reduce the number of visual tokens to one-quarter of the original. Therefore, in our model, a
448 × 448 image is represented by 256 visual tokens.

**1.20.1 Performance**

**130 Chapter 1. Documentation**


**Image Benchmarks**

```
Benchmark GPT-4o-
20240513
```
```
Claude3.5-
Sonnet
```
```
InternVL2-
40B
```
```
InternVL2-Llama3-
76B
Model Size - - 40B 76B
```
```
DocVQAtest 92.8 95.2 93.9 94.1
ChartQAtest 85.7 90.8 86.2 88.4
InfoVQAtest - - 78.7 82.0
TextVQAval - - 83.0 84.4
OCRBench 736 788 837 839
MMEsum 2328.7 1920.0 2315.0 2414.7
RealWorldQA 75.4 60.1 71.8 72.2
AI2Dtest 94.2 94.7 87.1 87.6
MMMUval 69.1 / 69.2 68.3 / 65.9 53.9 / 55.2 55.2 / 58.2
MMBench-ENtest 83.4 79.7 86.8 86.5
MMBench-CNtest 82.1 80.7 86.5 86.3
CCBenchdev 71.2 54.1 80.6 81.0
MMVetGPT-4-0613 - - 68.5 69.8
MMVetGPT-4-
Turbo
```
### 69.1 66.0 65.5 65.7

```
SEED-Image 77.1 - 78.2 78.2
HallBenchavg 55.0 49.9 56.9 55.2
MathVistatestmini 63.8 67.7 63.7 65.5
OpenCompassavg 69.9 67.9 69.7 71.0
```
- We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specically, the results
    reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-
    Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were
    evaluated using the VLMEvalKit.
- For MMMU, we report both the original scores (left side: evaluated using the InternVL codebase for InternVL
    series models, and sourced from technical reports or webpages for other models) and the VLMEvalKit scores
    (right side: collected from the OpenCompass leaderboard).
- Please note that evaluating the same model using dierent testing toolkits like InternVL and VLMEvalKit can
    result in slight dierences, which is normal. Updates to code versions and variations in environment and hardware
    can also cause minor discrepancies in results.

**Video Benchmarks**

```
Benchmark GPT-4o GPT-4V Gemini-Pro-1.5 InternVL2-40B InternVL2-Llama3-76B
Model Size - - - 40B 76B
```
```
MVBench - - - 72.5 69.6
MMBench-Video8f 1.62 1.53 1.30 1.32 1.37
MMBench-Video16f 1.86 1.68 1.60 1.45 1.52
Video-MMEw/o subs 71.9 59.9 75.0 61.2 61.2
Video-MMEw subs 77.2 63.3 81.3 62.4 62.8
```
- We evaluate our models on MVBench and Video-MME by extracting 16 frames from each video, and each frame

**1.20. Introduction of InternVL2 Series 131**


```
was resized to a 448x448 image.
```
**Grounding Benchmarks**

```
Model avg. Ref-
COCO(val)
```
```
Ref-
COCO(testA)
```
```
Ref-
COCO(testB)
```
```
Ref-
COCO+(val)
```
```
Ref-
COCO+(testA)
```
```
Ref-
COCO+(testB)
```
```
RefCOCO-g(val)RefCOCO-g(test)
```
### UNINEXT-

```
H(Specialist
SOTA)
```
### 88.9 92.6 94.3 91.5 85.2 89.6 79.8 88.7 89.4

```
Mini-
InternVL-
Chat-2B-V1-5
```
### 75.8 80.7 86.7 72.9 72.5 82.3 60.8 75.6 74.9

```
Mini-
InternVL-
Chat-4B-V1-5
```
### 84.4 88.0 91.4 83.5 81.5 87.4 73.8 84.7 84.6

```
InternVL-Chat-V1-588.8 91.4 93.7 87.1 87.0 92.3 80.9 88.5 89.3
```
```
InternVL2-1B 79.9 83.6 88.7 79.8 76.0 83.6 67.7 80.2 79.9
InternVL2-2B 77.7 82.3 88.2 75.9 73.5 82.8 63.3 77.6 78.3
InternVL2-4B 84.4 88.5 91.2 83.9 81.2 87.2 73.8 84.6 84.6
InternVL2-8B 82.9 87.1 91.1 80.7 79.8 87.9 71.4 82.7 82.7
InternVL2-26B 88.5 91.2 93.3 87.4 86.8 91.0 81.2 88.5 88.6
InternVL2-40B 90.3 93.0 94.7 89.2 88.5 92.8 83.6 90.3 90.6
InternVL2-
Llama3-76B
```
### 90.0 92.2 94.8 88.4 88.8 93.1 82.8 89.5 90.3

- We use the following prompt to evaluate InternVL’s grounding ability: Please provide the bounding box
    coordinates of the region this sentence describes: <ref>{}</ref>

**1.20.2 Citation**

If yound this project useful in your research, please consider citing:

@article{chen2024far,
title={How far are we to gpt-4v? closing the gap to commercial multimodal models with␣
˓→open-source suites},
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei␣
˓→and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and␣
˓→others},
journal={Science China Information Sciences},
volume={67},
number={12},
pages={220101},
year={2024},
publisher={Springer}
}
@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
(continues on next page)

**132 Chapter 1. Documentation**


(continued from previous page)
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
˓→Recognition},
pages={24185--24198},
year={2024}
}

## 1.21 Quick Start of InternVL2 Series

```
Please use transformers>=4.37.2 to ensure the model works normally.
```
**1.21.1 Model Preparation**

```
model name type param download size
InternVL2-1B MLLM 0.9B HF link 1.8 GB
InternVL2-2B MLLM 2.2B HF link 4.2 GB
InternVL2-4B MLLM 4.2B HF link 7.8 GB
InternVL2-8B MLLM 8.1B HF link 16 GB
InternVL2-26B MLLM 25.5B HF link 48 GB
InternVL2-40B MLLM 40.1B HF link 75 GB
InternVL2-Llama3-76B MLLM 76.3B HF link 143 GB
```
Download the above model weights according to your need and place them in the pretrained/ folder.

cd pretrained/
# pip install -U huggingface_hub
# Download OpenGVLab/InternVL2-1B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-1B --local-dir InternVL2-1B
# Download OpenGVLab/InternVL2-2B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-2B --local-dir InternVL2-2B
# Download OpenGVLab/InternVL2-4B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-4B --local-dir InternVL2-4B
# Download OpenGVLab/InternVL2-8B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-8B --local-dir InternVL2-8B
# Download OpenGVLab/InternVL2-26B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-26B --local-dir InternVL2-26B
# Download OpenGVLab/InternVL2-40B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-40B --local-dir InternVL2-40B
# Download OpenGVLab/InternVL2-Llama3-76B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-Llama3-76B --local-dir InternVL2-Llama3-76B

The directory structure is:

**1.21. Quick Start of InternVL2 Series 133**


pretrained
InternVL2-1B
InternVL2-2B
InternVL2-4B
InternVL2-8B
InternVL2-26B
InternVL2-40B
InternVL2-Llama3-76B

**1.21.2 Model Loading**

**16-bit (bf16 / fp16)**

1B

2B

4B

8B

26B

40B

76B

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2-1B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2-2B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2-4B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()

**134 Chapter 1. Documentation**


import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2-8B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2-26B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2-40B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2-Llama3-76B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()

**BNB 8-bit Quantization**

1B

2B

4B

8B

26B

40B

76B

**1.21. Quick Start of InternVL2 Series 135**


import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2-1B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2-2B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2-4B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2-8B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2-26B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
(continues on next page)

**136 Chapter 1. Documentation**


```
(continued from previous page)
use_flash_attn=True,
trust_remote_code=True).eval()
```
import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2-40B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2-Llama3-76B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

**BNB 4-bit Quantization**

1B

2B

4B

8B

26B

40B

76B

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2-1B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_4bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
(continues on next page)

**1.21. Quick Start of InternVL2 Series 137**


```
(continued from previous page)
```
path= "OpenGVLab/InternVL2-2B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_4bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2-4B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_4bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL2-8B"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_4bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

```
Warning: Due to signicant quantization errors with BNB 4-bit quantization on InternViT-6B, the model
may produce nonsensical outputs and fail to understand images. Therefore, please avoid using BNB 4-bit
quantization.
Warning: Due to signicant quantization errors with BNB 4-bit quantization on InternViT-6B, the model
may produce nonsensical outputs and fail to understand images. Therefore, please avoid using BNB 4-bit
quantization.
Warning: Due to signicant quantization errors with BNB 4-bit quantization on InternViT-6B, the model
may produce nonsensical outputs and fail to understand images. Therefore, please avoid using BNB 4-bit
quantization.
```
**Multiple GPUs**

The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not
being on the same device. By ensuring that therst and last layers of the large language model (LLM) are on the same
device, we prevent such errors.

import math
import torch
fromtransformersimport AutoTokenizer, AutoModel

```
(continues on next page)
```
**138 Chapter 1. Documentation**


```
(continued from previous page)
```
defsplit_model(model_name):
device_map= {}
world_size= torch.cuda.device_count()
num_layers= {
'InternVL2-1B': 24 ,'InternVL2-2B': 24 ,'InternVL2-4B': 32 , 'InternVL2-8B': 32 ,
'InternVL2-26B': 48 , 'InternVL2-40B': 60 , 'InternVL2-Llama3-76B': 80 }[model_name]
# Since the first GPU will be used for ViT, treat it as half a GPU.
num_layers_per_gpu= math.ceil(num_layers /(world_size-0.5))
num_layers_per_gpu= [num_layers_per_gpu]*world_size
num_layers_per_gpu[ 0 ]=math.ceil(num_layers_per_gpu[ 0 ] *0.5)
layer_cnt= 0
fori, num_layerin enumerate(num_layers_per_gpu):
forjin range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] =i
layer_cnt+= 1
device_map['vision_model']= 0
device_map['mlp1']= 0
device_map['language_model.model.tok_embeddings']= 0
device_map['language_model.model.embed_tokens']= 0
device_map['language_model.output']= 0
device_map['language_model.model.norm']= 0
device_map['language_model.lm_head']= 0
device_map[f'language_model.model.layers.{num_layers- 1 }']= 0

```
return device_map
```
### 1B

### 2B

### 4B

### 8B

### 26B

### 40B

### 76B

path= "OpenGVLab/InternVL2-1B"
device_map =split_model('InternVL2-1B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

path= "OpenGVLab/InternVL2-2B"
device_map =split_model('InternVL2-2B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
(continues on next page)

**1.21. Quick Start of InternVL2 Series 139**


```
(continued from previous page)
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()
```
path= "OpenGVLab/InternVL2-4B"
device_map =split_model('InternVL2-4B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

path= "OpenGVLab/InternVL2-8B"
device_map =split_model('InternVL2-8B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

path= "OpenGVLab/InternVL2-26B"
device_map =split_model('InternVL2-26B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

path= "OpenGVLab/InternVL2-40B"
device_map =split_model('InternVL2-40B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

path= "OpenGVLab/InternVL2-Llama3-76B"
device_map =split_model('InternVL2-Llama3-76B')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
(continues on next page)

**140 Chapter 1. Documentation**


```
(continued from previous page)
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()
```
**1.21.3 Inference with Transformers**

import numpyas np
import torch
import torchvision.transformsas T
fromdecord import VideoReader, cpu
fromPILimport Image
fromtorchvision.transforms.functionalimport InterpolationMode
fromtransformersimport AutoModel, AutoTokenizer

IMAGENET_MEAN= (0.485, 0.456,0.406)
IMAGENET_STD=(0.229,0.224, 0.225)

defbuild_transform(input_size):
MEAN, STD= IMAGENET_MEAN, IMAGENET_STD
transform= T.Compose([
T.Lambda(lambda img: img.convert('RGB')if img.mode!= 'RGB'elseimg),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform

deffind_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff =float('inf')
best_ratio= ( 1 , 1 )
area= width*height
forratioin target_ratios:
target_aspect_ratio= ratio[ 0 ]/ratio[ 1 ]
ratio_diff= abs(aspect_ratio-target_aspect_ratio)
if ratio_diff< best_ratio_diff:
best_ratio_diff= ratio_diff
best_ratio= ratio
elifratio_diff == best_ratio_diff:
if area>0.5* image_size* image_size* ratio[ 0 ] *ratio[ 1 ]:
best_ratio= ratio
return best_ratio

defdynamic_preprocess(image, min_num= 1 , max_num= 12 , image_size= 448 , use_
˓→thumbnail=False):
orig_width, orig_height= image.size
aspect_ratio=orig_width /orig_height

```
# calculate the existing image aspect ratio
target_ratios= set(
(i, j)forn in range(min_num, max_num+ 1 )foriin range( 1 , n+ 1 ) forjin␣
˓→range( 1 , n+ 1 )if
(continues on next page)
```
**1.21. Quick Start of InternVL2 Series 141**


```
(continued from previous page)
i* j<= max_numandi* j>= min_num)
target_ratios= sorted(target_ratios, key=lambdax: x[ 0 ]* x[ 1 ])
```
```
# find the closest aspect ratio to the target
target_aspect_ratio =find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
```
```
# calculate the target width and height
target_width=image_size *target_aspect_ratio[ 0 ]
target_height= image_size* target_aspect_ratio[ 1 ]
blocks =target_aspect_ratio[ 0 ]* target_aspect_ratio[ 1 ]
```
```
# resize the image
resized_img=image.resize((target_width, target_height))
processed_images=[]
foriin range(blocks):
box=(
(i% (target_width// image_size))*image_size,
(i// (target_width// image_size))* image_size,
((i%(target_width// image_size))+ 1 )* image_size,
((i//(target_width //image_size))+ 1 )* image_size
)
# split the image
split_img= resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) ==blocks
if use_thumbnailand len(processed_images)!= 1 :
thumbnail_img= image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
```
defload_image(image_file, input_size= 448 , max_num= 12 ):
image= Image.open(image_file).convert('RGB')
transform= build_transform(input_size=input_size)
images =dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_
˓→num=max_num)
pixel_values=[transform(image) forimage in images]
pixel_values=torch.stack(pixel_values)
return pixel_values

# If you have an 80G A100 GPU, you can put the entire model on a single GPU.
# Otherwise, you need to load a model using multiple GPUs, please refer to the`Multiple␣
˓→GPUs` section.
path= 'OpenGVLab/InternVL2-8B'
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer= AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

```
(continues on next page)
```
**142 Chapter 1. Documentation**


```
(continued from previous page)
```
# set the max number of tiles in`max_num`
pixel_values=load_image('./examples/image1.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
generation_config= dict(max_new_tokens= 1024 , do_sample=False)

# pure-text conversation ()
question= 'Hello, who are you?'
response, history= model.chat(tokenizer, None, question, generation_config,␣
˓→history=None, return_history=True)
print(f'User:{question}\nAssistant: {response}')

question= 'Can you tell me a story?'
response, history= model.chat(tokenizer, None, question, generation_config,␣
˓→history=history, return_history=True)
print(f'User:{question}\nAssistant: {response}')

# single-image single-round conversation ()
question= '<image>\nPlease describe the image shortly.'
response= model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User:{question}\nAssistant: {response}')

# single-image multi-round conversation ()
question= '<image>\nPlease describe the image in detail.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,␣
˓→history=None, return_history=True)
print(f'User:{question}\nAssistant: {response}')

question= 'Please write a poem according to the image.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,␣
˓→history=history, return_history=True)
print(f'User:{question}\nAssistant: {response}')

# multi-image multi-round conversation, combined images ()
pixel_values1= load_image('./examples/image1.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
pixel_values2= load_image('./examples/image2.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
pixel_values=torch.cat((pixel_values1, pixel_values2), dim= 0 )

question= '<image>\nDescribe the two images in detail.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
history=None, return_history=True)
print(f'User:{question}\nAssistant: {response}')

question= 'What are the similarities and differences between these two images.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
history=history, return_history=True)
print(f'User:{question}\nAssistant: {response}')

# multi-image multi-round conversation, separate images ()
pixel_values1= load_image('./examples/image1.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
pixel_values2= load_image('./examples/image2.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
pixel_values=torch.cat((pixel_values1, pixel_values2), dim= 0 )
num_patches_list=[pixel_values1.size( 0 ), pixel_values2.size( 0 )]

```
(continues on next page)
```
**1.21. Quick Start of InternVL2 Series 143**


```
(continued from previous page)
```
question= 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list,
history=None, return_history=True)
print(f'User:{question}\nAssistant: {response}')

question= 'What are the similarities and differences between these two images.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list,
history=history, return_history=True)
print(f'User:{question}\nAssistant: {response}')

# batch inference, single image per sample ()
pixel_values1= load_image('./examples/image1.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
pixel_values2= load_image('./examples/image2.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
num_patches_list=[pixel_values1.size( 0 ), pixel_values2.size( 0 )]
pixel_values=torch.cat((pixel_values1, pixel_values2), dim= 0 )

questions= ['<image>\nDescribe the image in detail.'] *len(num_patches_list)
responses= model.batch_chat(tokenizer, pixel_values,
num_patches_list=num_patches_list,
questions=questions,
generation_config=generation_config)
forquestion, responsein zip(questions, responses):
print(f'User:{question}\nAssistant: {response}')

# video multi-round conversation ()
defget_index(bound, fps, max_frame, first_idx= 0 , num_segments= 32 ):
if bound:
start, end= bound[ 0 ], bound[ 1 ]
else:
start, end= - 100000 , 100000
start_idx= max(first_idx,round(start *fps))
end_idx=min(round(end* fps), max_frame)
seg_size=float(end_idx-start_idx) /num_segments
frame_indices= np.array([
int(start_idx+ (seg_size/ 2 )+np.round(seg_size *idx))
foridxin range(num_segments)
])
return frame_indices

defload_video(video_path, bound=None, input_size= 448 , max_num= 1 , num_segments= 32 ):
vr =VideoReader(video_path, ctx=cpu( 0 ), num_threads= 1 )
max_frame= len(vr) - 1
fps=float(vr.get_avg_fps())

```
pixel_values_list, num_patches_list= [], []
transform= build_transform(input_size=input_size)
frame_indices= get_index(bound, fps, max_frame, first_idx= 0 , num_segments=num_
˓→segments)
forframe_indexinframe_indices:
img=Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
(continues on next page)
```
**144 Chapter 1. Documentation**


```
(continued from previous page)
img=dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_
˓→num=max_num)
pixel_values=[transform(tile)fortilein img]
pixel_values=torch.stack(pixel_values)
num_patches_list.append(pixel_values.shape[ 0 ])
pixel_values_list.append(pixel_values)
pixel_values=torch.cat(pixel_values_list)
return pixel_values, num_patches_list
```
video_path ='./examples/red-panda.mp4'
pixel_values, num_patches_list= load_video(video_path, num_segments= 8 , max_num= 1 )
pixel_values=pixel_values.to(torch.bfloat16).cuda()
video_prefix=''.join([f'Frame{i+ 1 }: <image>\n' foriin range(len(num_patches_list))])
question= video_prefix+'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list, history=None, return_
˓→history=True)
print(f'User:{question}\nAssistant: {response}')

question= 'Describe this video in detail. Don\'t repeat.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list, history=history,␣
˓→return_history=True)
print(f'User:{question}\nAssistant: {response}')

**Streaming Output**

Besides this method, you can also use the following code to get streamed output.

fromtransformersimport TextIteratorStreamer
fromthreading importThread

# Initialize the streamer
streamer= TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True,␣
˓→timeout= 10 )
# Define the generation configuration
generation_config= dict(max_new_tokens= 1024 , do_sample=False, streamer=streamer)
# Start the model chat in a separate thread
thread =Thread(target=model.chat, kwargs=dict(
tokenizer=tokenizer, pixel_values=pixel_values, question=question,
history=None, return_history=False, generation_config=generation_config,
))
thread.start()

# Initialize an empty string to store the generated text
generated_text= ''
# Loop through the streamer to get the new text as it is generated
fornew_textin streamer:
if new_text== model.conv_template.sep:
break
generated_text+= new_text
(continues on next page)

**1.21. Quick Start of InternVL2 Series 145**


```
(continued from previous page)
print(new_text, end='', flush=True) # Print each new chunk of generated text on the␣
˓→same line
```
**1.21.4 Citation**

If yound this project useful in your research, please consider citing:

@article{chen2024far,
title={How far are we to gpt-4v? closing the gap to commercial multimodal models with␣
˓→open-source suites},
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei␣
˓→and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and␣
˓→others},
journal={Science China Information Sciences},
volume={67},
number={12},
pages={220101},
year={2024},
publisher={Springer}
}
@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
˓→Recognition},
pages={24185--24198},
year={2024}
}

## 1.22 Fine-tune on a Custom Dataset

**1.22.1 Model Preparation**

```
model name type param download size
InternVL2-1B MLLM 0.9B HF link 1.8 GB
InternVL2-2B MLLM 2.2B HF link 4.2 GB
InternVL2-4B MLLM 4.2B HF link 7.8 GB
InternVL2-8B MLLM 8.1B HF link 16 GB
InternVL2-26B MLLM 25.5B HF link 48 GB
InternVL2-40B MLLM 40.1B HF link 75 GB
InternVL2-Llama3-76B MLLM 76.3B HF link 143 GB
```
Before starting the secondne-tuning, download the pre-trained model we provide.

cd pretrained/
# pip install -U huggingface_hub
(continues on next page)

**146 Chapter 1. Documentation**


```
(continued from previous page)
```
# Download OpenGVLab/InternVL2-1B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-1B --local-dir InternVL2-1B
# Download OpenGVLab/InternVL2-2B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-2B --local-dir InternVL2-2B
# Download OpenGVLab/InternVL2-4B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-4B --local-dir InternVL2-4B
# Download OpenGVLab/InternVL2-8B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-8B --local-dir InternVL2-8B
# Download OpenGVLab/InternVL2-26B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-26B --local-dir InternVL2-26B
# Download OpenGVLab/InternVL2-40B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-40B --local-dir InternVL2-40B
# Download OpenGVLab/InternVL2-Llama3-76B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-Llama3-76B --local-dir InternVL2-Llama3-76B

The directory structure is:

pretrained
InternVL2-1B
InternVL2-2B
InternVL2-4B
InternVL2-8B
InternVL2-26B
InternVL2-40B
InternVL2-Llama3-76B

**1.22.2 Prepare Customized Data**

After downloading the pre-trained model, prepare your customized SFT (Supervised Fine-Tuning) data. Create a JSON
le in internvl_chat/shell/data/ similar tothis example.

The format for the JSONle should be:

{
"your-custom-dataset-1": {
"root": "path/to/the/image/",
"annotation":"path/to/the/jsonl/annotation",
"data_augment": false,
"max_dynamic_patch": 12 ,
"repeat_time": 1 ,
"length":"number of samples in the dataset"
}
}

Example:

**1.22. Fine-tune on a Custom Dataset 147**


### {

"sharegpt4v_instruct_gpt4-vision_cap100k": {
"root": "playground/data/",
"annotation":"playground/opensource/sharegpt4v_instruct_gpt4-vision_cap100k.jsonl",
"data_augment": false,
"max_dynamic_patch": 12 ,
"repeat_time": 1 ,
"length": 102025
}
}

The format for each specic JSONL (such as plain text data, single-image data, multi-image data, video data) can be
organized according to the descriptions provided in _this document_.

My suggestion is to add new domain-specic data on top of the _general data from our open-sourced InternVL 1.2_.
This will enhance downstream capabilities while retaining the foundational skills. Of course, you can also choose to
ne-tune solely on the new data based on your requirements.

**1.22.3 Start 2nd Fine-tuning**

1B

2B

4B

8B

26B

40B

76B

Fine-tune the pre-trained models using either thescript for training the full LLMor thescript for training the LoRA
adapter, depending on your available GPU resources.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL2-1B.

In the default settings, I have frozen the visual encoder. You can unfreeze it if needed. Generally, unfreezing the visual
encoder will result in better performance.

```
Fine-tuning the full LLM requires 8x 32G/40G GPUs, whereasne-tuning the LoRA requires 2x 32G/40G
GPUs.
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

# Using 8 GPUs, fine-tune the full LLM, cost about 30G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl2.0/2nd_finetune/internvl2_1b_qwen2_0_5b_
˓→dynamic_res_2nd_finetune_full.sh
# Using 2 GPUs, fine-tune the LoRA, cost about 27G per GPU
GPUS= 2 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl2.0/2nd_finetune/internvl2_1b_qwen2_0_5b_
˓→dynamic_res_2nd_finetune_lora.sh
# Using 8 GPUs, fine-tune the LoRA, cost about 27G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl2.0/2nd_finetune/internvl2_1b_qwen2_0_5b_
˓→dynamic_res_2nd_finetune_lora.sh

**148 Chapter 1. Documentation**


Fine-tune the pre-trained models using either thescript for training the full LLMor thescript for training the LoRA
adapter, depending on your available GPU resources.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL2-2B.

In the default settings, I have frozen the visual encoder. You can unfreeze it if needed. Generally, unfreezing the visual
encoder will result in better performance.

```
Fine-tuning the full LLM requires 8x 32G/40G GPUs, whereasne-tuning the LoRA requires 2x 32G/40G
GPUs.
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

# Using 8 GPUs, fine-tune the full LLM, cost about 30G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl2.0/2nd_finetune/internvl2_2b_internlm2_
˓→1_8b_dynamic_res_2nd_finetune_full.sh
# Using 2 GPUs, fine-tune the LoRA, cost about 27G per GPU
GPUS= 2 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl2.0/2nd_finetune/internvl2_2b_internlm2_
˓→1_8b_dynamic_res_2nd_finetune_lora.sh
# Using 8 GPUs, fine-tune the LoRA, cost about 27G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl2.0/2nd_finetune/internvl2_2b_internlm2_
˓→1_8b_dynamic_res_2nd_finetune_lora.sh

Fine-tune the pre-trained models using either thescript for training the full LLMor thescript for training the LoRA
adapter, depending on your available GPU resources.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL2-4B.

In the default settings, I have frozen the visual encoder. You can unfreeze it if needed. Generally, unfreezing the visual
encoder will result in better performance.

```
Fine-tuning the full LLM requires 8x 40G GPUs, whereasne-tuning the LoRA requires 2x 24G GPUs.
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

# Using 8 GPUs, fine-tune the full LLM, cost about 40G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl2.0/2nd_finetune/internvl2_4b_phi3_3_8b_
˓→dynamic_res_2nd_finetune_full.sh
# Using 2 GPUs, fine-tune the LoRA, cost about 19G per GPU
GPUS= 2 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl2.0/2nd_finetune/internvl2_4b_phi3_3_8b_
˓→dynamic_res_2nd_finetune_lora.sh
# Using 8 GPUs, fine-tune the LoRA, cost about 19G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl2.0/2nd_finetune/internvl2_4b_phi3_3_8b_
˓→dynamic_res_2nd_finetune_lora.sh

Fine-tune the pre-trained models using either thescript for training the full LLMor thescript for training the LoRA
adapter, depending on your available GPU resources.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL2-8B.

**1.22. Fine-tune on a Custom Dataset 149**


In the default settings, I have frozen the visual encoder. You can unfreeze it if needed. Generally, unfreezing the visual
encoder will result in better performance.

```
Fine-tuning the full LLM requires 8 A100 80G GPUs, whereas ne-tuning the LoRA requires 2 A100
80G GPUs.
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

# Using 8 GPUs, fine-tune the full LLM, cost about 77G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl2.0/2nd_finetune/internvl2_8b_internlm2_
˓→7b_dynamic_res_2nd_finetune_full.sh
# Using 2 GPUs, fine-tune the LoRA, cost about 79G per GPU
GPUS= 2 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl2.0/2nd_finetune/internvl2_8b_internlm2_
˓→7b_dynamic_res_2nd_finetune_lora.sh
# Using 8 GPUs, fine-tune the LoRA, cost about 60G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl2.0/2nd_finetune/internvl2_8b_internlm2_
˓→7b_dynamic_res_2nd_finetune_lora.sh

Fine-tune the pre-trained models using either thescript for training the full LLMor thescript for training the LoRA
adapter, depending on your available GPU resources.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL2-26B.

In the default settings, I have frozen the visual encoder. You can unfreeze it if needed. Generally, unfreezing the visual
encoder will result in better performance.

```
Fine-tuning the full LLM requires 8 A100 80G GPUs, whereas ne-tuning the LoRA requires 2 A100
80G GPUs.
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

# Using 8 GPUs, fine-tune the full LLM, cost about 77G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl2.0/2nd_finetune/internvl2_26b_internlm2_
˓→20b_dynamic_res_2nd_finetune_full.sh
# Using 2 GPUs, fine-tune the LoRA, cost about 79G per GPU
GPUS= 2 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl2.0/2nd_finetune/internvl2_26b_internlm2_
˓→20b_dynamic_res_2nd_finetune_lora.sh
# Using 8 GPUs, fine-tune the LoRA, cost about 60G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl2.0/2nd_finetune/internvl2_26b_internlm2_
˓→20b_dynamic_res_2nd_finetune_lora.sh

Fine-tune the pre-trained models using either thescript for training the full LLMor thescript for training the LoRA
adapter, depending on your available GPU resources.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL2-40B.

In the default settings, I have frozen the visual encoder. You can unfreeze it if needed. Generally, unfreezing the visual
encoder will result in better performance.

```
Fine-tuning the full LLM requires 16 A100 80G GPUs, whereasne-tuning the LoRA requires 2 A100
80G GPUs.
```
**150 Chapter 1. Documentation**


```
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

# Using 16 GPUs with SLURM system, fine-tune the full LLM, cost about 77G per GPU
PARTITION='your partition'GPUS= 16 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl2.0/2nd_
˓→finetune/internvl2_40b_hermes2_yi_34b_dynamic_res_2nd_finetune_full.sh
# Using 2 GPUs, fine-tune the LoRA, cost about 74G per GPU
GPUS= 2 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl2.0/2nd_finetune/internvl2_40b_hermes2_
˓→yi_34b_dynamic_res_2nd_finetune_lora.sh
# Using 8 GPUs, fine-tune the LoRA, cost about 74G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl2.0/2nd_finetune/internvl2_40b_hermes2_
˓→yi_34b_dynamic_res_2nd_finetune_lora.sh

Fine-tune the pre-trained models using either thescript for training the full LLMor thescript for training the LoRA
adapter, depending on your available GPU resources.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL2-Llama3-76B.

In the default settings, I have frozen the visual encoder. You can unfreeze it if needed. Generally, unfreezing the visual
encoder will result in better performance.

```
Fine-tuning the full LLM requires 32 A100 80G GPUs, whereasne-tuning the LoRA requires 8 A100
80G GPUs.
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

# Using 32 GPUs with SLURM system, fine-tune the full LLM, cost about 77G per GPU
PARTITION='your partition'GPUS= 32 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl2.0/2nd_
˓→finetune/internvl2_76b_hermes2_llama3_70b_dynamic_res_2nd_finetune_full.sh
# Using 8 GPUs, fine-tune the LoRA, cost about 74G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl2.0/2nd_finetune/internvl2_76b_hermes2_
˓→llama3_70b_dynamic_res_2nd_finetune_lora.sh

If you encounter any issues, please let me know, and I will update the training guide to enhance its usability.

**1.22.4 Citation**

If yound this project useful in your research, please consider citing:

@article{chen2024far,
title={How far are we to gpt-4v? closing the gap to commercial multimodal models with␣
˓→open-source suites},
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei␣
˓→and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and␣
˓→others},
journal={Science China Information Sciences},
volume={67},
number={12},
pages={220101},
year={2024},
publisher={Springer}
(continues on next page)

**1.22. Fine-tune on a Custom Dataset 151**


```
(continued from previous page)
```
}
@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
˓→Recognition},
pages={24185--24198},
year={2024}
}

## 1.23 Evaluation of InternVL2 Series

To evaluate the performance of the InternVL2 series across various tasks, follow the instructions for each specic
dataset. Ensure that the appropriate number of GPUs is allocated as specied.

```
1 We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specically, the re-
sults reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet,
and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and
MathVista were evaluated using the VLMEvalKit.
2 Please note that evaluating the same model using dierent testing toolkits like InternVL and VLMEvalKit
can result in slight dierences, which is normal. Updates to code versions and variations in environment
and hardware can also cause minor discrepancies in results.
3 Note, the dataset description is generated by GPT-4 and may contain errors.
```
**1.23.1 Model Preparation**

```
model name type param download size
InternVL2-1B MLLM 0.9B HF link 1.8 GB
InternVL2-2B MLLM 2.2B HF link 4.2 GB
InternVL2-4B MLLM 4.2B HF link 7.8 GB
InternVL2-8B MLLM 8.1B HF link 16 GB
InternVL2-26B MLLM 25.5B HF link 48 GB
InternVL2-40B MLLM 40.1B HF link 75 GB
InternVL2-Llama3-76B MLLM 76.3B HF link 143 GB
```
Before evaluation, download the trained model we provide.

cd pretrained/
# pip install -U huggingface_hub
# Download OpenGVLab/InternVL2-1B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-1B --local-dir InternVL2-1B
# Download OpenGVLab/InternVL2-2B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-2B --local-dir InternVL2-2B
# Download OpenGVLab/InternVL2-4B
(continues on next page)

**152 Chapter 1. Documentation**


```
(continued from previous page)
```
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-4B --local-dir InternVL2-4B
# Download OpenGVLab/InternVL2-8B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-8B --local-dir InternVL2-8B
# Download OpenGVLab/InternVL2-26B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-26B --local-dir InternVL2-26B
# Download OpenGVLab/InternVL2-40B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-40B --local-dir InternVL2-40B
# Download OpenGVLab/InternVL2-Llama3-76B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-Llama3-76B --local-dir InternVL2-Llama3-76B

The directory structure is:

pretrained
InternVL2-1B
InternVL2-2B
InternVL2-4B
InternVL2-8B
InternVL2-26B
InternVL2-40B
InternVL2-Llama3-76B

**1.23.2 Evaluation using InternVL Codebase**

**Data Preparation**

Please prepare the evaluation data according to the _guidance provided here_.

### MME

MME is a comprehensive benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on both
perception and cognition abilities across 14 dierent subtasks, ensuring robust and diverse testing of these models.

1B

2B

4B

8B

26B

40B

76B

Please use the following command to perform the test with 1 GPU:

GPUS= 1 sh evaluate.sh pretrained/InternVL2-1B mme --dynamic

The expected test results are:

**1.23. Evaluation of InternVL2 Series 153**


===========Perception===========
total score:1346.1990796318528

```
existence score:175.0
count score:113.33333333333334
position score:135.0
color score:138.33333333333331
posters score: 116.32653061224491
celebrity score:144.70588235294116
scene score:143.25
landmark score:128.5
artwork score: 141.75
OCR score:110.0
```
===========Cognition===========
total score:448.2142857142857

```
commonsense_reasoning score:95.71428571428571
numerical_calculation score:57.5
text_translation score: 177.5
code_reasoning score:117.5
```
Please use the following command to perform the test with 1 GPU:

GPUS= 1 sh evaluate.sh pretrained/InternVL2-2B mme --dynamic

The expected test results are:

===========Perception===========
total score:1439.6688675470189

```
existence score:200.0
count score:128.33333333333334
position score:145.0
color score:163.33333333333334
posters score: 131.97278911564626
celebrity score:118.52941176470588
scene score:157.0
landmark score:154.0
artwork score: 146.5
OCR score:95.0
```
===========Cognition===========
total score:437.1428571428571

```
commonsense_reasoning score:112.14285714285714
numerical_calculation score:45.0
text_translation score: 177.5
code_reasoning score:102.5
```
Please use the following command to perform the test with 1 GPU:

**154 Chapter 1. Documentation**


GPUS= 1 sh evaluate.sh pretrained/InternVL2-4B mme --dynamic

The expected test results are:

===========Perception===========
total score:1532.31662665066

```
existence score:200.0
count score:123.33333333333333
position score:148.33333333333331
color score:165.0
posters score: 155.78231292517006
celebrity score:124.11764705882354
scene score:158.75
landmark score:165.0
artwork score: 144.5
OCR score:147.5
```
===========Cognition===========
total score:531.7857142857142

```
commonsense_reasoning score:129.28571428571428
numerical_calculation score:115.0
text_translation score: 170.0
code_reasoning score:117.5
```
Please use the following command to perform the test with 1 GPU:

GPUS= 1 sh evaluate.sh pretrained/InternVL2-8B mme --dynamic

The expected test results are:

===========Perception===========
total score:1648.1331532613044

```
existence score:190.0
count score:158.33333333333331
position score:163.33333333333334
color score:175.0
posters score: 167.68707482993196
celebrity score:148.52941176470586
scene score:152.5
landmark score:176.5
artwork score: 153.75
OCR score:162.5
```
===========Cognition===========
total score:562.1428571428571

```
commonsense_reasoning score:147.14285714285714
numerical_calculation score:87.5
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 155**


```
(continued from previous page)
text_translation score: 192.5
code_reasoning score:135.0
```
Please use the following command to perform the test with 1 GPU:

GPUS= 1 sh evaluate.sh pretrained/InternVL2-26B mme --dynamic

The expected test results are:

===========Perception===========
total score:1720.0325130052022

```
existence score:195.0
count score:170.0
position score:176.66666666666669
color score:168.33333333333331
posters score: 176.87074829931973
celebrity score:159.41176470588235
scene score:154.0
landmark score:179.5
artwork score: 162.75
OCR score:177.5
```
===========Cognition===========
total score:540.7142857142858

```
commonsense_reasoning score:145.71428571428572
numerical_calculation score:95.0
text_translation score: 185.0
code_reasoning score:115.0
```
Please use the following command to perform the test with 8 GPU:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B mme --dynamic --auto

The expected test results are:

===========Perception===========
total score:1715.390456182473

```
existence score:185.0
count score:175.0
position score:158.33333333333331
color score:188.33333333333331
posters score: 187.41496598639458
celebrity score:162.05882352941177
scene score:152.5
landmark score:180.25
artwork score: 171.5
OCR score:155.0
```
```
(continues on next page)
```
**156 Chapter 1. Documentation**


```
(continued from previous page)
```
===========Cognition===========
total score:599.6428571428571

```
commonsense_reasoning score:152.14285714285714
numerical_calculation score:125.0
text_translation score: 177.5
code_reasoning score:145.0
```
Please use the following command to perform the test with 8 GPU:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mme --dynamic --auto

The expected test results are:

===========Perception===========
total score:1731.095538215286

```
existence score:200.0
count score:175.0
position score:168.33333333333331
color score:185.0
posters score: 186.39455782312925
celebrity score:169.11764705882354
scene score:152.0
landmark score:182.0
artwork score: 173.25
OCR score:140.0
```
===========Cognition===========
total score:683.5714285714286

```
commonsense_reasoning score:158.57142857142856
numerical_calculation score:185.0
text_translation score: 177.5
code_reasoning score:162.5
```
### OKVQA

OKVQA (Outside Knowledge Visual Question Answering) is a dataset designed for visual question answering tasks
that require external knowledge beyond what is visible in the image, featuring over 14,000 questions to evaluate the
reasoning abilities of AI models.

1B

2B

4B

8B

26B

40B

76B

**1.23. Evaluation of InternVL2 Series 157**


Please use the following command to perform the test with 8 GPU:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B vqa-okvqa-val --dynamic

The expected test results are:

okvqa_val0.48513674197383483

Please use the following command to perform the test with 8 GPU:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B vqa-okvqa-val --dynamic

The expected test results are:

okvqa_val0.5316290130796605

Please use the following command to perform the test with 8 GPU:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B vqa-okvqa-val --dynamic

The expected test results are:

okvqa_val0.6007530717399846

Please use the following command to perform the test with 8 GPU:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B vqa-okvqa-val --dynamic

The expected test results are:

okvqa_val0.6289734443123187

Please use the following command to perform the test with 8 GPU:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B vqa-okvqa-val --dynamic

The expected test results are:

okvqa_val0.6594530321046287

Please use the following command to perform the test with 8 GPU:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B vqa-okvqa-val --dynamic --auto

The expected test results are:

okvqa_val0.664288545382473

Please use the following command to perform the test with 8 GPU:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-okvqa-val --dynamic --auto

The expected test results are:

okvqa_val0.683432421720166

**158 Chapter 1. Documentation**


**TextVQA**

TextVQA is a dataset designed to evaluate visual question answering models by requiring them to read and reason
about text present within images, containing 45,336 questions over 28,408 images from the OpenImages dataset.

The TextVQA dataset provides ocial OCR results, specically Rosetta OCR tokens. During testing with InstructBLIP
and LLaVA 1.5, the OCR results are input to the LLM as a prompt. If you want to input Rosetta OCR tokens, use the
following command:

1B

2B

4B

8B

26B

40B

76B

We do not use Rosetta OCR tokens, run this command:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B vqa-textvqa-val --dynamic

The expected test results are:

textvqa_val0.7052400000000033

We do not use Rosetta OCR tokens, run this command:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B vqa-textvqa-val --dynamic

The expected test results are:

textvqa_val0.7335600000000035

We do not use Rosetta OCR tokens, run this command:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B vqa-textvqa-val --dynamic

The expected test results are:

textvqa_val0.7437000000000039

We do not use Rosetta OCR tokens, run this command:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B vqa-textvqa-val --dynamic

The expected test results are:

textvqa_val0.773740000000004

We do not use Rosetta OCR tokens, run this command:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B vqa-textvqa-val --dynamic

The expected test results are:

**1.23. Evaluation of InternVL2 Series 159**


textvqa_val0.8228200000000048

We do not use Rosetta OCR tokens, run this command:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B vqa-textvqa-val --dynamic --auto

The expected test results are:

textvqa_val0.8301600000000046

We do not use Rosetta OCR tokens, run this command:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-textvqa-val --dynamic --auto

The expected test results are:

textvqa_val0.844100000000004

**VizWiz**

The VizWiz VQA dataset is a visual question answering dataset created to help answer visual questions posed by blind
individuals. It contains over 31,000 visual questions, where users took a picture using a mobile phone and recorded a
spoken question about it. Each question comes with 10 crowdsourced answers. This dataset addresses tasks such as
predicting the answer to a visual question and determining whether a visual question can be answered.

1B

2B

4B

8B

26B

40B

76B

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B vqa-vizwiz-val --dynamic

The expected test results are:

vizwiz_val 0.5306783977772626

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B vqa-vizwiz-test --dynamic

For the test set, submit the results to theevaluation server.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B vqa-vizwiz-val --dynamic

The expected test results are:

**160 Chapter 1. Documentation**


vizwiz_val 0.47376707571196724

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B vqa-vizwiz-test --dynamic

For the test set, submit the results to theevaluation server.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B vqa-vizwiz-val --dynamic

The expected test results are:

vizwiz_val 0.622088446399631

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B vqa-vizwiz-test --dynamic

For the test set, submit the results to theevaluation server.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B vqa-vizwiz-val --dynamic

The expected test results are:

vizwiz_val 0.6290808057420708

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B vqa-vizwiz-test --dynamic

For the test set, submit the results to theevaluation server.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B vqa-vizwiz-val --dynamic

The expected test results are:

vizwiz_val 0.6839083121092873

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B vqa-vizwiz-test --dynamic

For the test set, submit the results to theevaluation server.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B vqa-vizwiz-val --dynamic --auto

The expected test results are:

vizwiz_val 0.6521880064829846

**1.23. Evaluation of InternVL2 Series 161**


For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B vqa-vizwiz-test --dynamic --auto

For the test set, submit the results to theevaluation server.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-vizwiz-val --dynamic --auto

The expected test results are:

vizwiz_val 0.6767075711970381

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-vizwiz-test --dynamic --auto

For the test set, submit the results to theevaluation server.

**ChartQA**

The ChartQA dataset is a comprehensive benchmark for question answering about charts that involves both visual and
logical reasoning. It includes a mix of 9.6K human-written questions and 23.1K machine-generated questions derived
from chart summaries. This dataset is designed to evaluate models that can understand and analyze charts by answering
complex questions that often require multiple logical and arithmetic operations, as well as referencing visual features
of the charts.

1B

2B

4B

8B

26B

40B

76B

The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. Thenal score
for model evaluation is calculated as the average of the scores on these two test sets:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B vqa-chartqa-test --dynamic --max-num 12

The expected test results are:

['chartqa_test_human', {'relaxed_accuracy':0.5392}]
['chartqa_test_augmented', {'relaxed_accuracy':0.9184}]

result =(53.92 +91.84) / 2 =72.88

The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. Thenal score
for model evaluation is calculated as the average of the scores on these two test sets:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B vqa-chartqa-test --dynamic --max-num 12

The expected test results are:

**162 Chapter 1. Documentation**


['chartqa_test_human', {'relaxed_accuracy':0.5952}]
['chartqa_test_augmented', {'relaxed_accuracy':0.9296}]

result =(59.52 +92.96) / 2 =76.24

The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. Thenal score
for model evaluation is calculated as the average of the scores on these two test sets:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B vqa-chartqa-test --dynamic --max-num 12

The expected test results are:

['chartqa_test_human', {'relaxed_accuracy':0.6992}]
['chartqa_test_augmented', {'relaxed_accuracy':0.9304}]

result =(69.92 +93.04) / 2 =81.48

The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. Thenal score
for model evaluation is calculated as the average of the scores on these two test sets:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B vqa-chartqa-test --dynamic --max-num 12

The expected test results are:

['chartqa_test_human', {'relaxed_accuracy':0.7288}]
['chartqa_test_augmented', {'relaxed_accuracy':0.9368}]

result =(72.88 +93.68) / 2 =83.28

The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. Thenal score
for model evaluation is calculated as the average of the scores on these two test sets:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B vqa-chartqa-test --dynamic --max-num 12

The expected test results are:

['chartqa_test_human', {'relaxed_accuracy':0.7528}]
['chartqa_test_augmented', {'relaxed_accuracy':0.9448}]

result =(75.28 +94.48) / 2 =84.88

The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. Thenal score
for model evaluation is calculated as the average of the scores on these two test sets:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B vqa-chartqa-test --dynamic --max-num 12 --
˓→auto

The expected test results are:

['chartqa_test_human', {'relaxed_accuracy':0.772}]
['chartqa_test_augmented', {'relaxed_accuracy':0.952}]

result =(77.2+ 95.2)/ 2 = 86.2

**1.23. Evaluation of InternVL2 Series 163**


The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. Thenal score
for model evaluation is calculated as the average of the scores on these two test sets:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-chartqa-test --dynamic --max-
˓→num 12 --auto

The expected test results are:

['chartqa_test_human', {'relaxed_accuracy':0.816}]
['chartqa_test_augmented', {'relaxed_accuracy':0.952}]

result =(81.6+ 95.2)/ 2 = 88.4

**DocVQA**

The DocVQA dataset consists of 50,000 questions on 12,000+ document images. It is designed for visual question
answering tasks where questions are answered using text within the document images. The dataset includes OCR
transcriptions and ground truth answers, supporting evaluation of models that interpret and extract information from
documents.

1B

2B

4B

8B

26B

40B

76B

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B vqa-docvqa-val --dynamic --max-num 18

The expected test results are:

Overall ANLS:0.7999

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B vqa-docvqa-test --dynamic --max-num 18

For the test set, submit the results to theevaluation server.

The expected test results are:

Overall ANLS:0.8170

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B vqa-docvqa-val --dynamic --max-num 18

The expected test results are:

Overall ANLS:0.8590

**164 Chapter 1. Documentation**


For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B vqa-docvqa-test --dynamic --max-num 18

For the test set, submit the results to theevaluation server.

The expected test results are:

Overall ANLS:0.8690

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B vqa-docvqa-val --dynamic --max-num 18

The expected test results are:

Overall ANLS:0.8809

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B vqa-docvqa-test --dynamic --max-num 18

For the test set, submit the results to theevaluation server.

The expected test results are:

Overall ANLS:0.8920

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B vqa-docvqa-val --dynamic --max-num 18

The expected test results are:

Overall ANLS:0.9081

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B vqa-docvqa-test --dynamic --max-num 18

For the test set, submit the results to theevaluation server.

The expected test results are:

Overall ANLS:0.9160

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B vqa-docvqa-val --dynamic --max-num 18

The expected test results are:

Overall ANLS:0.9212

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B vqa-docvqa-test --dynamic --max-num 18

**1.23. Evaluation of InternVL2 Series 165**


For the test set, submit the results to theevaluation server.

The expected test results are:

Overall ANLS:0.9290

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B vqa-docvqa-val --dynamic --max-num 18 --
˓→auto

The expected test results are:

Overall ANLS:0.9373

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B vqa-docvqa-test --dynamic --max-num 18 --
˓→auto

For the test set, submit the results to theevaluation server.

The expected test results are:

Overall ANLS:0.9390

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-docvqa-val --dynamic --max-num␣
˓→ 18 --auto

The expected test results are:

Overall ANLS:0.9417

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-docvqa-test --dynamic --max-
˓→num 18 --auto

For the test set, submit the results to theevaluation server.

The expected test results are:

Overall ANLS:0.9410

### AI2D

The AI2D dataset contains over 5,000 grade school science diagrams with extensive annotations and 15,000 multiple-
choice questions for research on diagram understanding and question answering.

1B

2B

4B

8B

26B

**166 Chapter 1. Documentation**


### 40B

### 76B

GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B vqa-ai2d-test --dynamic

The expected test results are:

ai2diagram_test {'accuracy': 0.6408678756476683}

GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B vqa-ai2d-test --dynamic

The expected test results are:

ai2diagram_test {'accuracy': 0.7409326424870466}

GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B vqa-ai2d-test --dynamic

The expected test results are:

ai2diagram_test {'accuracy': 0.788860103626943}

GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B vqa-ai2d-test --dynamic

The expected test results are:

ai2diagram_test {'accuracy': 0.8377590673575129}

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B vqa-ai2d-test --dynamic

The expected test results are:

ai2diagram_test {'accuracy': 0.844559585492228}

GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B vqa-ai2d-test --dynamic --auto

The expected test results are:

ai2diagram_test {'accuracy': 0.8711139896373057}

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-ai2d-test --dynamic --auto

The expected test results are:

ai2diagram_test {'accuracy': 0.8762953367875648}

**InfographicVQA**

The InfographicVQA dataset is a collection of infographics accompanied by natural language questions and answers.
This dataset includes a diverse range of infographics sourced from thousands of dierent websites, ensuring a variety
of layouts and designs. It comprises 30,035 questions across 5,485 images, split into training, validation, and test sets.

1B

2B

**1.23. Evaluation of InternVL2 Series 167**


### 4B

### 8B

### 26B

### 40B

### 76B

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B vqa-infovqa-val --dynamic --max-num 24

The expected test results are:

Overall ANLS:0.5018

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B vqa-infovqa-test --dynamic --max-num 24

For the test set, submit the results to theevaluation server.

The expected test results are:

Overall ANLS:0.5090

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B vqa-infovqa-val --dynamic --max-num 24

The expected test results are:

Overall ANLS:0.5766

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B vqa-infovqa-test --dynamic --max-num 24

For the test set, submit the results to theevaluation server.

The expected test results are:

Overall ANLS:0.5890

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B vqa-infovqa-val --dynamic --max-num 24

The expected test results are:

Overall ANLS:0.6625

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B vqa-infovqa-test --dynamic --max-num 24

**168 Chapter 1. Documentation**


For the test set, submit the results to theevaluation server.

The expected test results are:

Overall ANLS:0.6700

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B vqa-infovqa-val --dynamic --max-num 24

The expected test results are:

Overall ANLS:0.7260

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B vqa-infovqa-test --dynamic --max-num 24

For the test set, submit the results to theevaluation server.

The expected test results are:

Overall ANLS:0.7480

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B vqa-infovqa-val --dynamic --max-num 24

The expected test results are:

Overall ANLS:0.7601

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B vqa-infovqa-test --dynamic --max-num 24

For the test set, submit the results to theevaluation server.

The expected test results are:

Overall ANLS:0.7590

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B vqa-infovqa-val --dynamic --max-num 24 --
˓→auto

The expected test results are:

Overall ANLS:0.7851

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B vqa-infovqa-test --dynamic --max-num 24 --
˓→auto

For the test set, submit the results to theevaluation server.

The expected test results are:

**1.23. Evaluation of InternVL2 Series 169**


Overall ANLS:0.7870

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-infovqa-val --dynamic --max-
˓→num 24 --auto

The expected test results are:

Overall ANLS:0.8021

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-infovqa-test --dynamic --max-
˓→num 24 --auto

For the test set, submit the results to theevaluation server.

The expected test results are:

Overall ANLS:0.8200

### GQA

The GQA dataset is a large-scale visual question answering dataset designed for real-world visual reasoning and com-
positional question answering. It contains over 22 million questions grounded in real images, each accompanied by
detailed scene graphs that describe objects, their attributes, and relationships within the scene. The dataset includes im-
ages from the Visual Genome dataset, with questions that require various reasoning skills such as spatial understanding
and multi-step inference.

1B

2B

4B

8B

26B

40B

76B

GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B vqa-gqa-testdev --dynamic

The expected test results are:

Accuracy:59.77%

GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B vqa-gqa-testdev --dynamic

The expected test results are:

Accuracy:61.03%

GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B vqa-gqa-testdev --dynamic

**170 Chapter 1. Documentation**


The expected test results are:

Accuracy:62.07%

GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B vqa-gqa-testdev --dynamic

The expected test results are:

Accuracy:63.23%

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B vqa-gqa-testdev --dynamic

The expected test results are:

Accuracy:64.89%

GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B vqa-gqa-testdev --dynamic --auto

The expected test results are:

Accuracy:64.89%

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B vqa-gqa-testdev --dynamic --auto

The expected test results are:

Accuracy:65.22%

### POPE

The POPE (Polling-based Object Probing Evaluation) dataset is designed to evaluate object hallucination in MLLMs.
The dataset consists of 3,000 questions related to the captions of 500 images. By treating the MLLMs’ answers to these
questions as a binary classication task, the dataset allows researchers to measure accuracy, precision, recall, and F1
scores to determine the extent of hallucination in the models.

1B

2B

4B

8B

26B

40B

76B

GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B pope --dynamic

The expected test results are:

Category: random,# samples: 2910
TP FP TN FN
1239 51 1359 261
Accuracy:0.8927835051546392
(continues on next page)

**1.23. Evaluation of InternVL2 Series 171**


```
(continued from previous page)
```
Precision: 0.9604651162790697
Recall:0.826
F1 score:0.8881720430107527
Yes ratio: 0.44329896907216493
0.888, 0.893,0.960, 0.826,0.443
====================================
Category: popular,# samples: 3000
TP FP TN FN
1239 93 1407 261
Accuracy:0.882
Precision: 0.9301801801801802
Recall:0.826
F1 score:0.875
Yes ratio: 0.444
0.875, 0.882,0.930, 0.826,0.444
====================================
Category: adversarial,# samples: 3000
TP FP TN FN
1239 151 1349 261
Accuracy:0.8626666666666667
Precision: 0.8913669064748202
Recall:0.826
F1 score:0.8574394463667819
Yes ratio: 0.4633333333333333
0.857, 0.863,0.891, 0.826,0.463
====================================

result =(88.8+ 87.5+ 85.7) / 3 =87.3

GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B pope --dynamic

The expected test results are:

Category: random,# samples: 2910
TP FP TN FN
1256 39 1371 244
Accuracy:0.9027491408934708
Precision: 0.9698841698841699
Recall:0.8373333333333334
F1 score:0.898747763864043
Yes ratio: 0.44501718213058417
0.899, 0.903,0.970, 0.837,0.445
====================================
Category: popular,# samples: 3000
TP FP TN FN
1256 89 1411 244
Accuracy:0.889
Precision: 0.9338289962825279
Recall:0.8373333333333334
F1 score:0.8829525483304044
Yes ratio: 0.4483333333333333
0.883, 0.889,0.934, 0.837,0.448
(continues on next page)

**172 Chapter 1. Documentation**


```
(continued from previous page)
```
====================================
Category: adversarial,# samples: 3000
TP FP TN FN
1256 139 1361 244
Accuracy:0.8723333333333333
Precision: 0.9003584229390681
Recall:0.8373333333333334
F1 score:0.8677029360967184
Yes ratio: 0.465
0.868, 0.872,0.900, 0.837,0.465
====================================

result =(89.9+ 88.3+ 86.8) / 3 =88.3

GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B pope --dynamic

The expected test results are:

Category: random,# samples: 2910
TP FP TN FN
1247 54 1356 253
Accuracy:0.8945017182130585
Precision: 0.9584934665641814
Recall:0.8313333333333334
F1 score:0.8903962870403428
Yes ratio: 0.4470790378006873
0.890, 0.895,0.958, 0.831,0.447
====================================
Category: popular,# samples: 3000
TP FP TN FN
1247 116 1384 253
Accuracy:0.877
Precision: 0.9148936170212766
Recall:0.8313333333333334
F1 score:0.8711142158574922
Yes ratio: 0.4543333333333333
0.871, 0.877,0.915, 0.831,0.454
====================================
Category: adversarial,# samples: 3000
TP FP TN FN
1247 175 1325 253
Accuracy:0.8573333333333333
Precision: 0.8769338959212377
Recall:0.8313333333333334
F1 score:0.8535249828884327
Yes ratio: 0.474
0.854, 0.857,0.877, 0.831,0.474
====================================

result =(89.0+ 87.1+ 85.4) / 3 =87.2

**1.23. Evaluation of InternVL2 Series 173**


GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B pope --dynamic

The expected test results are:

Category: random,# samples: 2910
TP FP TN FN
1204 29 1381 296
Accuracy:0.8883161512027491
Precision: 0.9764801297648013
Recall:0.8026666666666666
F1 score:0.8810830589096232
Yes ratio: 0.42371134020618556
0.881, 0.888,0.976, 0.803,0.424
====================================
Category: popular,# samples: 3000
TP FP TN FN
1204 67 1433 296
Accuracy:0.879
Precision: 0.9472856018882769
Recall:0.8026666666666666
F1 score:0.8690003608805486
Yes ratio: 0.4236666666666667
0.869, 0.879,0.947, 0.803,0.424
====================================
Category: adversarial,# samples: 3000
TP FP TN FN
1204 101 1399 296
Accuracy:0.8676666666666667
Precision: 0.9226053639846743
Recall:0.8026666666666666
F1 score:0.8584670231729055
Yes ratio: 0.435
0.858, 0.868,0.923, 0.803,0.435
====================================

result =(88.1+ 86.9+ 85.8) / 3 =86.9

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B pope --dynamic

The expected test results are:

Category: random,# samples: 2910
TP FP TN FN
1221 25 1385 279
Accuracy:0.89553264604811
Precision: 0.9799357945425361
Recall:0.814
F1 score:0.8892935178441369
Yes ratio: 0.4281786941580756
0.889, 0.896,0.980, 0.814,0.428
====================================
Category: popular,# samples: 3000
TP FP TN FN
(continues on next page)

**174 Chapter 1. Documentation**


```
(continued from previous page)
```
1221 57 1443 279
Accuracy:0.888
Precision: 0.9553990610328639
Recall:0.814
F1 score:0.8790496760259179
Yes ratio: 0.426
0.879, 0.888,0.955, 0.814,0.426
====================================
Category: adversarial,# samples: 3000
TP FP TN FN
1221 84 1416 279
Accuracy:0.879
Precision: 0.9356321839080459
Recall:0.814
F1 score:0.8705882352941177
Yes ratio: 0.435
0.871, 0.879,0.936, 0.814,0.435
====================================

result =(88.9+ 87.9+ 87.1) / 3 =88.0

GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B pope --dynamic --auto

The expected test results are:

Category: random,# samples: 2910
TP FP TN FN
1232 16 1394 268
Accuracy:0.902405498281787
Precision: 0.9871794871794872
Recall:0.8213333333333334
F1 score:0.8966521106259098
Yes ratio: 0.4288659793814433
0.897, 0.902,0.987, 0.821,0.429
====================================
Category: popular,# samples: 3000
TP FP TN FN
1232 65 1435 268
Accuracy:0.889
Precision: 0.9498843484965305
Recall:0.8213333333333334
F1 score:0.8809438684304614
Yes ratio: 0.43233333333333335
0.881, 0.889,0.950, 0.821,0.432
====================================
Category: adversarial,# samples: 3000
TP FP TN FN
1232 87 1413 268
Accuracy:0.8816666666666667
Precision: 0.934040940106141
Recall:0.8213333333333334
F1 score:0.8740688187300462
(continues on next page)

**1.23. Evaluation of InternVL2 Series 175**


```
(continued from previous page)
```
Yes ratio: 0.43966666666666665
0.874, 0.882,0.934, 0.821,0.440
====================================

result =(89.7+ 88.1+ 87.4) / 3 =88.4

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B pope --dynamic --auto

The expected test results are:

Category: random,# samples: 2910
TP FP TN FN
1251 26 1384 249
Accuracy:0.9054982817869416
Precision: 0.9796397807361003
Recall:0.834
F1 score:0.9009722722362261
Yes ratio: 0.4388316151202749
0.901, 0.905,0.980, 0.834,0.439
====================================
Category: popular,# samples: 3000
TP FP TN FN
1251 62 1438 249
Accuracy:0.8963333333333333
Precision: 0.9527798933739527
Recall:0.834
F1 score:0.8894418769996445
Yes ratio: 0.43766666666666665
0.889, 0.896,0.953, 0.834,0.438
====================================
Category: adversarial,# samples: 3000
TP FP TN FN
1251 91 1409 249
Accuracy:0.8866666666666667
Precision: 0.9321907600596125
Recall:0.834
F1 score:0.8803659394792399
Yes ratio: 0.44733333333333336
0.880, 0.887,0.932, 0.834,0.447
====================================

result =(90.1+ 88.9+ 88.0) / 3 =89.0

**Tiny LVLM**

The Tiny LVLM-eHub is a streamlined evaluation benchmark designed to assess the multimodal capabilities of
MLLMs, including models like Bard. It focuses on six categories of multimodal abilities: visual perception, visual
knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence.

1B

2B

4B

**176 Chapter 1. Documentation**


### 8B

### 26B

### 40B

### 76B

GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B tiny_lvlm --dynamic

The expected test results are:

Visual_Knowledge_Acquisition:0.6857142857142857
Object_Hallucination:0.91
Visual_Commonsense: 0.556
Visual_Perception:0.4875
Visual_Reasoning:0.6145454545454545
Overall:3.2537597402597402

GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B tiny_lvlm --dynamic

The expected test results are:

Visual_Knowledge_Acquisition:0.71
Object_Hallucination:0.91
Visual_Commonsense: 0.558
Visual_Perception:0.4675
Visual_Reasoning:0.649090909090909
Overall:3.294590909090909

GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B tiny_lvlm --dynamic

The expected test results are:

Visual_Knowledge_Acquisition:0.6814285714285714
Object_Hallucination:0.89
Visual_Commonsense: 0.652
Visual_Perception:0.4875
Visual_Reasoning:0.6563636363636364
Overall:3.3672922077922074

GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B tiny_lvlm --dynamic

The expected test results are:

Visual_Knowledge_Acquisition:0.6985714285714286
Object_Hallucination:0.8966666666666666
Visual_Commonsense: 0.652
Visual_Perception:0.485
Visual_Reasoning:0.6854545454545454
Overall:3.417692640692641

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B tiny_lvlm --dynamic

The expected test results are:

**1.23. Evaluation of InternVL2 Series 177**


Visual_Knowledge_Acquisition:0.7614285714285715
Object_Hallucination:0.9
Visual_Commonsense: 0.652
Visual_Perception:0.555
Visual_Reasoning:0.7109090909090909
Overall:3.5793376623376627

GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B tiny_lvlm --dynamic --auto

The expected test results are:

Visual_Knowledge_Acquisition:0.75
Object_Hallucination:0.8966666666666666
Visual_Commonsense: 0.674
Visual_Perception:0.5325
Visual_Reasoning:0.730909090909091
Overall:3.5840757575757576

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B tiny_lvlm --dynamic --auto

The expected test results are:

Visual_Knowledge_Acquisition:0.7557142857142857
Object_Hallucination:0.9166666666666666
Visual_Commonsense: 0.69
Visual_Perception:0.525
Visual_Reasoning:0.7418181818181818
Overall:3.629199134199134

### MMMU

The MMMU dataset is a comprehensive benchmark designed to evaluate multimodal models on college-level tasks that
require domain-specic knowledge and reasoning. It includes 11,500 questions sourced from college exams, quizzes,
and textbooks, spanning six disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social
Science, and Tech & Engineering. These questions cover 30 subjects and feature 30 types of images, such as charts,
diagrams, maps, tables, and more.

1B

2B

4B

8B

26B

40B

76B

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B mmmu-val --dynamic

The expected test results are:

**178 Chapter 1. Documentation**


{'Overall-Art and Design': {'num': 120 ,'acc': 0.383}, 'Art': {'num': 30 ,'acc': 0.4},
˓→'Art_Theory': {'num': 30 ,'acc':0.4},'Design': {'num': 30 , 'acc':0.567}, 'Music': {
˓→'num': 30 ,'acc': 0.167},'Overall-Business': {'num': 150 ,'acc': 0.333}, 'Accounting
˓→': {'num': 30 , 'acc': 0.333},'Economics': {'num': 30 ,'acc': 0.433}, 'Finance': {'num
˓→': 30 ,'acc': 0.067}, 'Manage': {'num': 30 , 'acc':0.367}, 'Marketing': {'num': 30 ,
˓→'acc':0.467}, 'Overall-Science': {'num': 150 ,'acc': 0.3},'Biology': {'num': 30 , 'acc
˓→':0.267}, 'Chemistry': {'num': 30 , 'acc':0.233},'Geography': {'num': 30 ,'acc': 0.
˓→ 367 }, 'Math': {'num': 30 ,'acc':0.167}, 'Physics': {'num': 30 ,'acc': 0.467},
˓→'Overall-Health and Medicine': {'num': 150 , 'acc':0.313}, 'Basic_Medical_Science': {
˓→'num': 30 ,'acc': 0.433},'Clinical_Medicine': {'num': 30 ,'acc': 0.233}, 'Diagnostics_
˓→and_Laboratory_Medicine': {'num': 30 ,'acc': 0.4},'Pharmacy': {'num': 30 , 'acc':0.3},
˓→'Public_Health': {'num': 30 ,'acc':0.2},'Overall-Humanities and Social Science': {
˓→'num': 120 ,'acc': 0.483}, 'History': {'num': 30 ,'acc': 0.4}, 'Literature': {'num':␣
˓→ 30 ,'acc':0.667}, 'Sociology': {'num': 30 , 'acc':0.467}, 'Psychology': {'num': 30 ,
˓→'acc':0.4},'Overall-Tech and Engineering': {'num': 210 , 'acc':0.348},'Agriculture
˓→': {'num': 30 , 'acc': 0.233},'Architecture_and_Engineering': {'num': 30 , 'acc':0.367}
˓→, 'Computer_Science': {'num': 30 ,'acc': 0.4},'Electronics': {'num': 30 , 'acc':0.4},
˓→'Energy_and_Power': {'num': 30 , 'acc':0.333},'Materials': {'num': 30 ,'acc': 0.4},
˓→'Mechanical_Engineering': {'num': 30 ,'acc': 0.3},
'Overall': {'num': 900 , 'acc':0.354}}

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B mmmu-test --dynamic

For the test set, submit the results to theevaluation server.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B mmmu-val --dynamic

The expected test results are:

{'Overall-Art and Design': {'num': 120 ,'acc': 0.392}, 'Art': {'num': 30 ,'acc': 0.467},
˓→'Art_Theory': {'num': 30 ,'acc':0.4},'Design': {'num': 30 , 'acc':0.5}, 'Music': {
˓→'num': 30 ,'acc': 0.2}, 'Overall-Business': {'num': 150 , 'acc':0.347}, 'Accounting': {
˓→'num': 30 ,'acc': 0.367},'Economics': {'num': 30 ,'acc': 0.333}, 'Finance': {'num':␣
˓→ 30 ,'acc':0.333}, 'Manage': {'num': 30 , 'acc': 0.367},'Marketing': {'num': 30 ,'acc
˓→':0.333}, 'Overall-Science': {'num': 150 ,'acc':0.213}, 'Biology': {'num': 30 ,'acc
˓→':0.233}, 'Chemistry': {'num': 30 , 'acc':0.1}, 'Geography': {'num': 30 , 'acc':0.167}
˓→, 'Math': {'num': 30 , 'acc':0.367}, 'Physics': {'num': 30 ,'acc': 0.2},'Overall-
˓→Health and Medicine': {'num': 150 , 'acc':0.373}, 'Basic_Medical_Science': {'num': 30 ,
˓→'acc':0.433}, 'Clinical_Medicine': {'num': 30 , 'acc':0.4}, 'Diagnostics_and_
˓→Laboratory_Medicine': {'num': 30 ,'acc': 0.4},'Pharmacy': {'num': 30 , 'acc':0.267},
˓→'Public_Health': {'num': 30 ,'acc': 0.367}, 'Overall-Humanities and Social Science': {
˓→'num': 120 ,'acc': 0.492}, 'History': {'num': 30 ,'acc': 0.4}, 'Literature': {'num':␣
˓→ 30 ,'acc':0.767}, 'Sociology': {'num': 30 , 'acc':0.433}, 'Psychology': {'num': 30 ,
˓→'acc':0.367}, 'Overall-Tech and Engineering': {'num': 210 ,'acc': 0.3},'Agriculture
˓→': {'num': 30 , 'acc': 0.433},'Architecture_and_Engineering': {'num': 30 , 'acc':0.233}
˓→, 'Computer_Science': {'num': 30 ,'acc': 0.233}, 'Electronics': {'num': 30 ,'acc': 0.
˓→ 367 }, 'Energy_and_Power': {'num': 30 ,'acc': 0.233}, 'Materials': {'num': 30 ,'acc':0.
˓→ 4 },'Mechanical_Engineering': {'num': 30 ,'acc': 0.2},
'Overall': {'num': 900 , 'acc':0.343}}

For the test set, run:

**1.23. Evaluation of InternVL2 Series 179**


GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B mmmu-test --dynamic

For the test set, submit the results to theevaluation server.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B mmmu-val --dynamic

The expected test results are:

'Overall': {'num': 900 , 'acc':0.470}

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B mmmu-test --dynamic

For the test set, submit the results to theevaluation server.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B mmmu-val --dynamic

The expected test results are:

{'Overall-Art and Design': {'num': 120 ,'acc': 0.608}, 'Art': {'num': 30 ,'acc': 0.733},
˓→'Art_Theory': {'num': 30 ,'acc':0.7},'Design': {'num': 30 , 'acc':0.733}, 'Music': {
˓→'num': 30 ,'acc': 0.267},'Overall-Business': {'num': 150 ,'acc': 0.453}, 'Accounting
˓→': {'num': 30 , 'acc': 0.467},'Economics': {'num': 30 ,'acc': 0.533}, 'Finance': {'num
˓→': 30 ,'acc': 0.333}, 'Manage': {'num': 30 , 'acc':0.4}, 'Marketing': {'num': 30 ,'acc
˓→':0.533}, 'Overall-Science': {'num': 150 ,'acc':0.393}, 'Biology': {'num': 30 ,'acc
˓→':0.467}, 'Chemistry': {'num': 30 , 'acc':0.267},'Geography': {'num': 30 ,'acc': 0.4}
˓→, 'Math': {'num': 30 , 'acc':0.5}, 'Physics': {'num': 30 , 'acc':0.333},'Overall-
˓→Health and Medicine': {'num': 150 , 'acc':0.507}, 'Basic_Medical_Science': {'num': 30 ,
˓→'acc':0.567}, 'Clinical_Medicine': {'num': 30 , 'acc':0.667}, 'Diagnostics_and_
˓→Laboratory_Medicine': {'num': 30 ,'acc': 0.467}, 'Pharmacy': {'num': 30 ,'acc': 0.367},
˓→'Public_Health': {'num': 30 ,'acc':0.467}, 'Overall-Humanities and Social Science': {
˓→'num': 120 ,'acc': 0.717}, 'History': {'num': 30 ,'acc': 0.767},'Literature': {'num':␣
˓→ 30 ,'acc':0.9},'Sociology': {'num': 30 ,'acc': 0.7}, 'Psychology': {'num': 30 ,'acc
˓→':0.5}, 'Overall-Tech and Engineering': {'num': 210 , 'acc': 0.39}, 'Agriculture': {
˓→'num': 30 ,'acc': 0.533},'Architecture_and_Engineering': {'num': 30 , 'acc':0.333},
˓→'Computer_Science': {'num': 30 , 'acc':0.5}, 'Electronics': {'num': 30 ,'acc': 0.467},
˓→'Energy_and_Power': {'num': 30 , 'acc':0.4}, 'Materials': {'num': 30 , 'acc':0.233},
˓→'Mechanical_Engineering': {'num': 30 ,'acc': 0.267},
'Overall': {'num': 900 , 'acc':0.493}}

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B mmmu-test --dynamic

For the test set, submit the results to theevaluation server.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B mmmu-val --dynamic

The expected test results are:

**180 Chapter 1. Documentation**


{'Overall-Art and Design': {'num': 120 ,'acc': 0.7},'Art': {'num': 30 , 'acc':0.767},
˓→'Art_Theory': {'num': 30 ,'acc':0.867}, 'Design': {'num': 30 ,'acc': 0.867},'Music':
˓→{'num': 30 ,'acc': 0.3},'Overall-Business': {'num': 150 , 'acc':0.407},'Accounting':
˓→{'num': 30 ,'acc': 0.467}, 'Economics': {'num': 30 ,'acc':0.3},'Finance': {'num': 30 ,
˓→'acc':0.333}, 'Manage': {'num': 30 ,'acc': 0.5},'Marketing': {'num': 30 ,'acc': 0.
˓→ 433 }, 'Overall-Science': {'num': 150 ,'acc': 0.373}, 'Biology': {'num': 30 ,'acc': 0.6}
˓→, 'Chemistry': {'num': 30 , 'acc':0.2}, 'Geography': {'num': 30 ,'acc':0.5},'Math': {
˓→'num': 30 ,'acc': 0.233},'Physics': {'num': 30 , 'acc':0.333},'Overall-Health and␣
˓→Medicine': {'num': 150 ,'acc': 0.453},'Basic_Medical_Science': {'num': 30 ,'acc': 0.
˓→ 467 }, 'Clinical_Medicine': {'num': 30 ,'acc':0.567}, 'Diagnostics_and_Laboratory_
˓→Medicine': {'num': 30 , 'acc':0.367},'Pharmacy': {'num': 30 , 'acc':0.367},'Public_
˓→Health': {'num': 30 ,'acc': 0.5},'Overall-Humanities and Social Science': {'num': 120 ,
˓→'acc':0.7}, 'History': {'num': 30 ,'acc': 0.7}, 'Literature': {'num': 30 ,'acc': 0.9}
˓→, 'Sociology': {'num': 30 , 'acc':0.6}, 'Psychology': {'num': 30 , 'acc':0.6},
˓→'Overall-Tech and Engineering': {'num': 210 , 'acc':0.39}, 'Agriculture': {'num': 30 ,
˓→'acc':0.467}, 'Architecture_and_Engineering': {'num': 30 ,'acc': 0.267}, 'Computer_
˓→Science': {'num': 30 , 'acc':0.367}, 'Electronics': {'num': 30 ,'acc': 0.367}, 'Energy_
˓→and_Power': {'num': 30 ,'acc': 0.5}, 'Materials': {'num': 30 , 'acc':0.433},
˓→'Mechanical_Engineering': {'num': 30 ,'acc': 0.333},
'Overall': {'num': 900 , 'acc':0.483}}

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B mmmu-test --dynamic

For the test set, submit the results to theevaluation server.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B mmmu-val --dynamic --auto

The expected test results are:

{'Overall-Art and Design': {'num': 120 ,'acc': 0.675}, 'Art': {'num': 30 ,'acc': 0.733},
˓→'Art_Theory': {'num': 30 ,'acc':0.833}, 'Design': {'num': 30 ,'acc': 0.767},'Music':
˓→{'num': 30 ,'acc': 0.367}, 'Overall-Business': {'num': 150 ,'acc': 0.44}, 'Accounting
˓→': {'num': 30 , 'acc': 0.467},'Economics': {'num': 30 ,'acc': 0.567}, 'Finance': {'num
˓→': 30 ,'acc': 0.333}, 'Manage': {'num': 30 , 'acc':0.367}, 'Marketing': {'num': 30 ,
˓→'acc':0.467}, 'Overall-Science': {'num': 150 ,'acc': 0.493}, 'Biology': {'num': 30 ,
˓→'acc':0.633}, 'Chemistry': {'num': 30 , 'acc':0.3}, 'Geography': {'num': 30 ,'acc':0.
˓→ 5 },'Math': {'num': 30 ,'acc': 0.5}, 'Physics': {'num': 30 ,'acc': 0.533}, 'Overall-
˓→Health and Medicine': {'num': 150 , 'acc':0.593}, 'Basic_Medical_Science': {'num': 30 ,
˓→'acc':0.5},'Clinical_Medicine': {'num': 30 ,'acc': 0.6}, 'Diagnostics_and_Laboratory_
˓→Medicine': {'num': 30 , 'acc':0.4}, 'Pharmacy': {'num': 30 ,'acc': 0.667}, 'Public_
˓→Health': {'num': 30 ,'acc': 0.8},'Overall-Humanities and Social Science': {'num': 120 ,
˓→'acc':0.717}, 'History': {'num': 30 ,'acc':0.767}, 'Literature': {'num': 30 , 'acc':␣
˓→0.833},'Sociology': {'num': 30 ,'acc': 0.6}, 'Psychology': {'num': 30 ,'acc': 0.667},
˓→'Overall-Tech and Engineering': {'num': 210 , 'acc':0.424},'Agriculture': {'num': 30 ,
˓→'acc':0.6},'Architecture_and_Engineering': {'num': 30 , 'acc':0.333}, 'Computer_
˓→Science': {'num': 30 , 'acc':0.467}, 'Electronics': {'num': 30 ,'acc': 0.433}, 'Energy_
˓→and_Power': {'num': 30 ,'acc': 0.467},'Materials': {'num': 30 ,'acc': 0.3},
˓→'Mechanical_Engineering': {'num': 30 ,'acc': 0.367},
'Overall': {'num': 900 , 'acc':0.539}}

For the test set, run:

**1.23. Evaluation of InternVL2 Series 181**


GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B mmmu-test --dynamic --auto

For the test set, submit the results to theevaluation server.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmmu-val --dynamic --auto

The expected test results are:

{'Overall-Art and Design': {'num': 120 ,'acc': 0.683}, 'Art': {'num': 30 ,'acc': 0.767},
˓→'Art_Theory': {'num': 30 ,'acc':0.933}, 'Design': {'num': 30 ,'acc': 0.7}, 'Music': {
˓→'num': 30 ,'acc': 0.333},'Overall-Business': {'num': 150 ,'acc': 0.567}, 'Accounting
˓→': {'num': 30 , 'acc': 0.5}, 'Economics': {'num': 30 , 'acc':0.567}, 'Finance': {'num':␣
˓→ 30 ,'acc':0.433}, 'Manage': {'num': 30 , 'acc': 0.633},'Marketing': {'num': 30 ,'acc
˓→':0.7}, 'Overall-Science': {'num': 150 , 'acc': 0.413},'Biology': {'num': 30 , 'acc':␣
˓→0.467},'Chemistry': {'num': 30 ,'acc': 0.3}, 'Geography': {'num': 30 , 'acc':0.433},
˓→'Math': {'num': 30 ,'acc': 0.367}, 'Physics': {'num': 30 , 'acc':0.5}, 'Overall-Health␣
˓→and Medicine': {'num': 150 , 'acc': 0.587},'Basic_Medical_Science': {'num': 30 , 'acc':␣
˓→0.533},'Clinical_Medicine': {'num': 30 , 'acc': 0.667},'Diagnostics_and_Laboratory_
˓→Medicine': {'num': 30 , 'acc':0.433},'Pharmacy': {'num': 30 , 'acc':0.6}, 'Public_
˓→Health': {'num': 30 ,'acc': 0.7},'Overall-Humanities and Social Science': {'num': 120 ,
˓→'acc':0.725}, 'History': {'num': 30 ,'acc':0.733}, 'Literature': {'num': 30 , 'acc':␣
˓→0.867},'Sociology': {'num': 30 ,'acc': 0.633}, 'Psychology': {'num': 30 , 'acc':0.667}
˓→, 'Overall-Tech and Engineering': {'num': 210 ,'acc': 0.443}, 'Agriculture': {'num':␣
˓→ 30 ,'acc':0.6},'Architecture_and_Engineering': {'num': 30 , 'acc':0.367}, 'Computer_
˓→Science': {'num': 30 , 'acc':0.567}, 'Electronics': {'num': 30 ,'acc': 0.433}, 'Energy_
˓→and_Power': {'num': 30 ,'acc': 0.367},'Materials': {'num': 30 ,'acc': 0.267},
˓→'Mechanical_Engineering': {'num': 30 ,'acc': 0.5},
'Overall': {'num': 900 , 'acc':0.552}}

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmmu-test --dynamic --auto

For the test set, submit the results to theevaluation server.

**MMVet (GPT-4-0613)**

```
Warning: Here, we use GPT-4-0613 as the judge model, while in VLMEvalKit, GPT-4-Turbo is used
as the judge model. Using dierent versions of GPT-4 can result in signicant score variations. Therefore,
testing the same model with the two codebases can lead to notable score dierences.
```
The MM-Vet dataset is a comprehensive benchmark designed to evaluate the integrated capabilities of MLLMs. It
encompasses six core vision-language (VL) capabilities: recognition, knowledge, optical character recognition (OCR),
spatial awareness, language generation, and math. The dataset includes 200 images and 218 questions, each requir-
ing one or more of these capabilities to answer. The evaluation uses an open-ended LLM-based approach, allowing
assessment across various answer styles and question types.

1B

2B

4B

8B

26B

**182 Chapter 1. Documentation**


### 40B

### 76B

GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B mmvet --dynamic

Then, submit the results to theevaluation server. The expected test results are:

runs: [37.8]

GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B mmvet --dynamic

Then, submit the results to theevaluation server. The expected test results are:

runs: [44.6]

GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B mmvet --dynamic

Then, submit the results to theevaluation server. The expected test results are:

runs: [55.7]

GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B mmvet --dynamic

Then, submit the results to theevaluation server. The expected test results are:

runs: [60.0]

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B mmvet --dynamic

Then, submit the results to theevaluation server. The expected test results are:

runs: [64.2]

GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B mmvet --dynamic --auto

Then, submit the results to theevaluation server. The expected test results are:

runs: [68.5]

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmvet --dynamic --auto

Then, submit the results to theevaluation server. The expected test results are:

runs: [69.8]

**MMBench**

The MMBench dataset is a comprehensive multi-modality benchmark designed to evaluate thene-grained abilities of
vision-language models. It contains around 3,000 multiple-choice questions covering 20 ability dimensions, structured
into a hierarchical taxonomy. These dimensions include perception and reasoning abilities, further broken down into
specic skills like coarse andne-grained perception, attribute reasoning, and logic reasoning.

1B

**1.23. Evaluation of InternVL2 Series 183**


### 2B

### 4B

### 8B

### 26B

### 40B

### 76B

For the English dev / test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B mmbench-dev-en --dynamic
GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B mmbench-test-en --dynamic

Then, submit the results to theevaluation server. The expected test results are:

mmbench-dev-en:-
mmbench-test-en:65.4

For the Chinese dev / test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B mmbench-dev-cn --dynamic
GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B mmbench-test-cn --dynamic

Then, submit the results to theevaluation server. The expected test results are:

mmbench-dev-cn:-
mmbench-test-cn:60.7

For the English dev / test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B mmbench-dev-en --dynamic
GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B mmbench-test-en --dynamic

Then, submit the results to theevaluation server. The expected test results are:

mmbench-dev-en:-
mmbench-test-en:73.2

For the Chinese dev / test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B mmbench-dev-cn --dynamic
GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B mmbench-test-cn --dynamic

Then, submit the results to theevaluation server. The expected test results are:

mmbench-dev-cn:-
mmbench-test-cn:70.9

For the English dev / test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B mmbench-dev-en --dynamic
GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B mmbench-test-en --dynamic

Then, submit the results to theevaluation server. The expected test results are:

**184 Chapter 1. Documentation**


mmbench-dev-en:-
mmbench-test-en:78.6

For the Chinese dev / test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B mmbench-dev-cn --dynamic
GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B mmbench-test-cn --dynamic

Then, submit the results to theevaluation server. The expected test results are:

mmbench-dev-cn:-
mmbench-test-cn:73.9

For the English dev / test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B mmbench-dev-en --dynamic
GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B mmbench-test-en --dynamic

Then, submit the results to theevaluation server. The expected test results are:

mmbench-dev-en:-
mmbench-test-en:81.7

For the Chinese dev / test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B mmbench-dev-cn --dynamic
GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B mmbench-test-cn --dynamic

Then, submit the results to theevaluation server. The expected test results are:

mmbench-dev-cn:-
mmbench-test-cn:81.2

For the English dev / test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B mmbench-dev-en --dynamic
GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B mmbench-test-en --dynamic

Then, submit the results to theevaluation server. The expected test results are:

mmbench-dev-en:-
mmbench-test-en:83.4

For the Chinese dev / test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B mmbench-dev-cn --dynamic
GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B mmbench-test-cn --dynamic

Then, submit the results to theevaluation server. The expected test results are:

mmbench-dev-cn:-
mmbench-test-cn:82.0

For the English dev / test set, run:

**1.23. Evaluation of InternVL2 Series 185**


GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B mmbench-dev-en --dynamic --auto
GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B mmbench-test-en --dynamic --auto

Then, submit the results to theevaluation server. The expected test results are:

mmbench-dev-en:-
mmbench-test-en:86.8

For the Chinese dev / test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B mmbench-dev-cn --dynamic --auto
GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B mmbench-test-cn --dynamic --auto

Then, submit the results to theevaluation server. The expected test results are:

mmbench-dev-cn:-
mmbench-test-cn:86.5

For the English dev / test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmbench-dev-en --dynamic --auto
GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmbench-test-en --dynamic --auto

Then, submit the results to theevaluation server. The expected test results are:

mmbench-dev-en:-
mmbench-test-en:86.5

For the Chinese dev / test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmbench-dev-cn --dynamic --auto
GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmbench-test-cn --dynamic --auto

Then, submit the results to theevaluation server. The expected test results are:

mmbench-dev-cn:-
mmbench-test-cn:86.3

**CCBench**

CCBench, a multi-modal benchmark in the domain of Chinese Culture, is designed to evaluate the performance of
MLLMs on tasks specically related to Chinese cultural content.

1B

2B

4B

8B

26B

40B

76B

**186 Chapter 1. Documentation**


GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B ccbench-dev --dynamic

Then, submit the results to theevaluation server. The expected test results are:

ccbench-dev:75.7

GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B ccbench-dev --dynamic

Then, submit the results to theevaluation server. The expected test results are:

ccbench-dev:74.7

GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B ccbench-dev --dynamic

Then, submit the results to theevaluation server. The expected test results are:

ccbench-dev:66.5

GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B ccbench-dev --dynamic

Then, submit the results to theevaluation server. The expected test results are:

ccbench-dev:75.9

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B ccbench-dev --dynamic

Then, submit the results to theevaluation server. The expected test results are:

ccbench-dev:73.5

GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B ccbench-dev --dynamic --auto

Then, submit the results to theevaluation server. The expected test results are:

ccbench-dev:80.6

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B ccbench-dev --dynamic --auto

Then, submit the results to theevaluation server. The expected test results are:

ccbench-dev:81.0

### SEED

CCBench is a multimodal benchmark specically designed to evaluate models on tasks related to Chinese culture. It
is part of the larger MMBench suite of benchmarks, developed by the OpenCompass Community, and aims to provide
ne-grained evaluations across various capabilities of vision-language models. CCBench includes 510 questions in a
multiple-choice format, focusing on cultural knowledge and understanding.

1B

2B

4B

**1.23. Evaluation of InternVL2 Series 187**


### 8B

### 26B

### 40B

### 76B

GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B seed --dynamic

The expected test results are:

Acc@ 1 : 0.6074485825458588
length: 17990
Accuracyforeach datatype:
DatatypeScene Understanding:73.05%
DatatypeInstance Identity: 71.16%
DatatypeInstance Location: 69.23%
DatatypeInstance Attributes:58.49%
DatatypeInstances Counting: 52.55%
DatatypeSpatial Relation:43.53%
DatatypeInstance Interaction:71.13%
DatatypeVisual Reasoning:72.51%
DatatypeText Understanding: 68.60%
DatatypeAction Recognition: 53.55%
DatatypeAction Prediction: 39.92%
DatatypeProcedure Understanding: 28.74%
Total accuracy: 60.76%
Image accuracy: 65.62%
Video accuracy: 42.35%

GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B seed --dynamic

The expected test results are:

Acc@ 1 : 0.6656475819899944
length: 17990
Accuracyforeach datatype:
DatatypeScene Understanding:76.92%
DatatypeInstance Identity: 76.79%
DatatypeInstance Location: 75.04%
DatatypeInstance Attributes:65.44%
DatatypeInstances Counting: 60.40%
DatatypeSpatial Relation:54.03%
DatatypeInstance Interaction:72.16%
DatatypeVisual Reasoning:76.74%
DatatypeText Understanding: 74.42%
DatatypeAction Recognition: 60.04%
DatatypeAction Prediction: 43.27%
DatatypeProcedure Understanding: 34.70%
Total accuracy: 66.56%
Image accuracy: 71.55%
Video accuracy: 47.67%

**188 Chapter 1. Documentation**


GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B seed --dynamic

The expected test results are:

Acc@ 1 : 0.6934408004446915
length: 17990
Accuracyforeach datatype:
DatatypeScene Understanding:78.75%
DatatypeInstance Identity: 76.79%
DatatypeInstance Location: 77.45%
DatatypeInstance Attributes:66.36%
DatatypeInstances Counting: 64.57%
DatatypeSpatial Relation:56.47%
DatatypeInstance Interaction:71.13%
DatatypeVisual Reasoning:78.25%
DatatypeText Understanding: 75.58%
DatatypeAction Recognition: 60.57%
DatatypeAction Prediction: 47.84%
DatatypeProcedure Understanding: 47.80%
Total accuracy: 69.34%
Image accuracy: 73.67%
Video accuracy: 52.94%

GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B seed --dynamic

The expected test results are:

Acc@ 1 : 0.7072262367982213
length: 17990
Accuracyforeach datatype:
DatatypeScene Understanding:79.89%
DatatypeInstance Identity: 78.97%
DatatypeInstance Location: 79.50%
DatatypeInstance Attributes:69.84%
DatatypeInstances Counting: 68.08%
DatatypeSpatial Relation:64.23%
DatatypeInstance Interaction:79.38%
DatatypeVisual Reasoning:78.85%
DatatypeText Understanding: 75.58%
DatatypeAction Recognition: 60.70%
DatatypeAction Prediction: 48.57%
DatatypeProcedure Understanding: 36.56%
Total accuracy: 70.72%
Image accuracy: 76.15%
Video accuracy: 50.17%

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B seed --dynamic

The expected test results are:

Acc@ 1 : 0.7245136186770428
length: 17990
Accuracyforeach datatype:
(continues on next page)

**1.23. Evaluation of InternVL2 Series 189**


```
(continued from previous page)
```
DatatypeScene Understanding:80.30%
DatatypeInstance Identity: 80.39%
DatatypeInstance Location: 79.88%
DatatypeInstance Attributes:71.78%
DatatypeInstances Counting: 69.68%
DatatypeSpatial Relation:61.95%
DatatypeInstance Interaction:75.26%
DatatypeVisual Reasoning:79.15%
DatatypeText Understanding: 68.60%
DatatypeAction Recognition: 65.47%
DatatypeAction Prediction: 54.20%
DatatypeProcedure Understanding: 44.28%
Total accuracy: 72.45%
Image accuracy: 76.79%
Video accuracy: 56.03%

GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B seed --dynamic --auto

The expected test results are:

Acc@ 1 : 0.7464146748193441
length: 17990
Accuracyforeach datatype:
DatatypeScene Understanding:80.62%
DatatypeInstance Identity: 82.36%
DatatypeInstance Location: 80.92%
DatatypeInstance Attributes:71.68%
DatatypeInstances Counting: 72.46%
DatatypeSpatial Relation:66.36%
DatatypeInstance Interaction:78.35%
DatatypeVisual Reasoning:80.06%
DatatypeText Understanding: 66.28%
DatatypeAction Recognition: 67.93%
DatatypeAction Prediction: 57.47%
DatatypeProcedure Understanding: 56.40%
Total accuracy: 74.65%
Image accuracy: 78.15%
Video accuracy: 61.38%

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B seed --dynamic --auto

The expected test results are:

Acc@ 1 : 0.7446359088382435
length: 17990
Accuracyforeach datatype:
DatatypeScene Understanding:80.40%
DatatypeInstance Identity: 82.25%
DatatypeInstance Location: 80.66%
DatatypeInstance Attributes:73.31%
DatatypeInstances Counting: 72.78%
DatatypeSpatial Relation:65.14%
(continues on next page)

**190 Chapter 1. Documentation**


```
(continued from previous page)
```
DatatypeInstance Interaction:79.38%
DatatypeVisual Reasoning:79.15%
DatatypeText Understanding: 77.91%
DatatypeAction Recognition: 68.26%
DatatypeAction Prediction: 55.10%
DatatypeProcedure Understanding: 55.23%
Total accuracy: 74.46%
Image accuracy: 78.17%
Video accuracy: 60.42%

### MMVP

The MMVP dataset is designed to benchmark the performance of multimodal large language models (MLLMs) in
visual question answering tasks. This dataset focuses on identifying “CLIP-blind pairs,” which are images that appear
similar to the CLIP model despite having clear visual dierences. The MMVP dataset includes 300 images derived
from ImageNet-1k and LAION-Aesthetics, each paired with straightforward questions to evaluate the models’ visual
capabilities. It highlights the challenges these systems face, often leading to incorrect responses and hallucinated
explanations.

1B

2B

4B

8B

26B

40B

76B

GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B mmvp --dynamic

The expected test results are:

Evaluating MMVP ...
Results saved to results/MMVP_240708020850.jsonl
The accuracyis 0.2

GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B mmvp --dynamic

The expected test results are:

Evaluating MMVP ...
Results saved to results/MMVP_240702122300.jsonl
The accuracyis 0.35333333333333333

GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B mmvp --dynamic

The expected test results are:

Evaluating MMVP ...
Results saved to results/MMVP_240702144108.jsonl
The accuracyis 0.4066666666666667

**1.23. Evaluation of InternVL2 Series 191**


GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B mmvp --dynamic

The expected test results are:

Evaluating MMVP ...
Results saved to results/MMVP_240703200956.jsonl
The accuracyis 0.5133333333333333

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B mmvp --dynamic

The expected test results are:

Evaluating MMVP ...
Results saved to results/MMVP_240704024433.jsonl
The accuracyis 0.5466666666666666

GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B mmvp --dynamic --auto

The expected test results are:

Evaluating MMVP ...
Results saved to results/MMVP_240708045836.jsonl
The accuracyis 0.5866666666666667

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mmvp --dynamic --auto

The expected test results are:

Evaluating MMVP ...
Results saved to results/MMVP_240718203234.jsonl
The accuracyis 0.5266666666666666

**RefCOCO Series**

RefCOCO, RefCOCO+, and RefCOCOg are datasets used for tasks involving referring expression comprehension,
segmentation, and generation. These datasets are built upon the MSCOCO dataset, and they are essential for evaluating
models in natural language processing and computer vision.

1B

2B

4B

8B

26B

40B

76B

GPUS= 8 sh evalulate.sh pretrained/InternVL2-1B refcoco --dynamic

GPUS= 8 sh evalulate.sh pretrained/InternVL2-2B refcoco --dynamic

**192 Chapter 1. Documentation**


GPUS= 8 sh evalulate.sh pretrained/InternVL2-4B refcoco --dynamic

GPUS= 8 sh evalulate.sh pretrained/InternVL2-8B refcoco --dynamic

GPUS= 8 sh evalulate.sh pretrained/InternVL2-26B refcoco --dynamic

GPUS= 8 sh evalulate.sh pretrained/InternVL2-40B refcoco --dynamic --auto

GPUS= 8 sh evalulate.sh pretrained/InternVL2-Llama3-76B refcoco --dynamic --auto

The expected test results are:

```
Model avg. Ref-
COCO(val)
```
```
Ref-
COCO(testA)
```
```
Ref-
COCO(testB)
```
```
Ref-
COCO+(val)
```
```
Ref-
COCO+(testA)
```
```
Ref-
COCO+(testB)
```
```
RefCOCO-g(val)RefCOCO-g(test)
```
```
InternVL2-1B79.9 83.6 88.7 79.8 76.0 83.6 67.7 80.2 79.9
InternVL2-2B77.7 82.3 88.2 75.9 73.5 82.8 63.3 77.6 78.3
InternVL2-4B84.4 88.5 91.2 83.9 81.2 87.2 73.8 84.6 84.6
InternVL2-8B82.9 87.1 91.1 80.7 79.8 87.9 71.4 82.7 82.7
InternVL2-26B88.5 91.2 93.3 87.4 86.8 91.0 81.2 88.5 88.6
InternVL2-40B90.3 93.0 94.7 89.2 88.5 92.8 83.6 90.3 90.6
InternVL2-
Llama3-76B
```
### 90.0 92.2 94.8 88.4 88.8 93.1 82.8 89.5 90.3

**MVBench**

MVBench is a comprehensive multimodal video understanding benchmark developed to evaluate the temporal com-
prehension capabilities of MLLMs. It includes 20 challenging video tasks that require temporal understanding and
cannot be eectively solved using a single frame. The benchmark uses a novel static-to-dynamic method, transforming
static tasks into dynamic ones to systematically generate video tasks that demand a wide range of temporal skills, from
perception to cognition.

We evaluate our models on MVBench by extracting 16 frames from each video, and each frame was resized to a
448x448 image.

1B

2B

4B

8B

26B

40B

76B

GPUS= 8 sh evaluate.sh pretrained/InternVL2-1B mvbench --dynamic --max-num 1

The expected test results are:

57.9

**1.23. Evaluation of InternVL2 Series 193**


GPUS= 8 sh evaluate.sh pretrained/InternVL2-2B mvbench --dynamic --max-num 1

The expected test results are:

60.2

GPUS= 8 sh evaluate.sh pretrained/InternVL2-4B mvbench --dynamic --max-num 1

The expected test results are:

63.7

GPUS= 8 sh evaluate.sh pretrained/InternVL2-8B mvbench --dynamic --max-num 1

The expected test results are:

66.4

GPUS= 8 sh evaluate.sh pretrained/InternVL2-26B mvbench --dynamic --max-num 1

The expected test results are:

67.5

GPUS= 8 sh evaluate.sh pretrained/InternVL2-40B mvbench --dynamic --max-num 1 --auto

The expected test results are:

72.5

GPUS= 8 sh evaluate.sh pretrained/InternVL2-Llama3-76B mvbench --dynamic --max-num 1 --
˓→auto

The expected test results are:

69.6

**1.23.3 Evaluation using VLMEvalKit Codebase**

**Data Preparation**

VLMEvalKit will automatically download the data for evaluation, so you do not need to prepare it manually.

**MathVista**

The MathVista dataset is a comprehensive benchmark for evaluating mathematical reasoning within visual contexts. It
consists of three newly created datasets—IQTest, FunctionQA, and PaperQA—designed to address logical reasoning
on puzzle testgures, algebraic reasoning over functional plots, and scientic reasoning with academic papergures,
respectively.

1B

2B

4B

**194 Chapter 1. Documentation**


### 8B

### 26B

### 40B

### 76B

torchrun --nproc-per-node= 8 run.py --data MathVista_MINI --model InternVL2-1B --verbose

The expected test results are:

"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","489","377","48.9","37.7"
"scientific reasoning","122","85","45","69.67213114754098","36.885245901639344"
"textbook question answering","158","92","63","58.22784810126582","39.87341772151899"
"numeric commonsense","144","39","24","27.083333333333332","16.666666666666664"
"arithmetic reasoning","353","102","103","28.89518413597734","29.178470254957507"
"visual question answering","179","92","53","51.39664804469274","29.608938547486037"
"geometry reasoning","239","147","95","61.50627615062761","39.74895397489539"
"algebraic reasoning","281","170","112","60.4982206405694","39.8576512455516"
"geometry problem solving","208","138","85","66.34615384615384","40.86538461538461"
"math word problem","186","26","52","13.978494623655912","27.956989247311824"
"logical reasoning","37","11","5","29.72972972972973","13.513513513513514"
"figure question answering","269","141","124","52.41635687732342","46.09665427509294"
"statistical reasoning","301","144","148","47.840531561461795","49.16943521594684"

torchrun --nproc-per-node= 8 run.py --data MathVista_MINI --model InternVL2-2B --verbose

The expected test results are:

"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","476","464","47.599999999999994","46.400000000000006"
"scientific reasoning","122","83","68","68.0327868852459","55.73770491803278"
"textbook question answering","158","95","79","60.12658227848101","50.0"
"numeric commonsense","144","35","37","24.305555555555554","25.694444444444443"
"arithmetic reasoning","353","100","146","28.328611898016998","41.359773371104815"
"visual question answering","179","91","86","50.83798882681564","48.04469273743017"
"geometry reasoning","239","144","103","60.25104602510461","43.09623430962343"
"algebraic reasoning","281","171","117","60.854092526690394","41.637010676156585"
"geometry problem solving","208","136","94","65.38461538461539","45.19230769230769"
"math word problem","186","20","62","10.75268817204301","33.33333333333333"
"logical reasoning","37","11","4","29.72972972972973","10.81081081081081"
"figure question answering","269","134","143","49.814126394052046","53.159851301115246"
"statistical reasoning","301","137","180","45.51495016611295","59.800664451827245"

torchrun --nproc-per-node= 8 run.py --data MathVista_MINI --model InternVL2-4B --verbose

The expected test results are:

"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","544","587","54.400000000000006","58.699999999999996"
"scientific reasoning","122","88","73","72.1311475409836","59.83606557377049"
"textbook question answering","158","97","93","61.39240506329114","58.86075949367089"
"numeric commonsense","144","37","43","25.694444444444443","29.86111111111111"
(continues on next page)

**1.23. Evaluation of InternVL2 Series 195**


```
(continued from previous page)
```
"arithmetic reasoning","353","139","197","39.376770538243626","55.80736543909348"
"visual question answering","179","94","87","52.513966480446925","48.60335195530726"
"geometry reasoning","239","146","133","61.08786610878661","55.64853556485355"
"algebraic reasoning","281","169","156","60.14234875444839","55.51601423487544"
"geometry problem solving","208","137","119","65.86538461538461","57.21153846153846"
"math word problem","186","54","119","29.03225806451613","63.97849462365591"
"logical reasoning","37","19","9","51.35135135135135","24.324324324324326"
"figure question answering","269","162","169","60.223048327137555","62.825278810408925"
"statistical reasoning","301","167","215","55.48172757475083","71.42857142857143"

torchrun --nproc-per-node= 8 run.py --data MathVista_MINI --model InternVL2-8B --verbose

The expected test results are:

"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","549","583","54.900000000000006","58.3"
"scientific reasoning","122","89","72","72.95081967213115","59.01639344262295"
"textbook question answering","158","101","97","63.92405063291139","61.39240506329114"
"numeric commonsense","144","39","44","27.083333333333332","30.555555555555557"
"arithmetic reasoning","353","128","199","36.26062322946176","56.37393767705382"
"visual question answering","179","92","89","51.39664804469274","49.72067039106145"
"geometry reasoning","239","160","144","66.94560669456067","60.25104602510461"
"algebraic reasoning","281","185","168","65.83629893238434","59.7864768683274"
"geometry problem solving","208","150","129","72.11538461538461","62.019230769230774"
"math word problem","186","49","110","26.344086021505376","59.13978494623656"
"logical reasoning","37","16","4","43.24324324324324","10.81081081081081"
"figure question answering","269","157","158","58.36431226765799","58.7360594795539"
"statistical reasoning","301","155","207","51.49501661129568","68.77076411960132"

torchrun --nproc-per-node= 8 run.py --data MathVista_MINI --model InternVL2-26B --verbose

The expected test results are:

"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","588","594","58.8","59.4"
"scientific reasoning","122","87","73","71.31147540983606","59.83606557377049"
"textbook question answering","158","98","97","62.0253164556962","61.39240506329114"
"numeric commonsense","144","38","49","26.38888888888889","34.02777777777778"
"arithmetic reasoning","353","157","212","44.47592067988669","60.05665722379604"
"visual question answering","179","91","97","50.83798882681564","54.18994413407822"
"geometry reasoning","239","164","139","68.6192468619247","58.15899581589959"
"algebraic reasoning","281","188","159","66.90391459074732","56.58362989323843"
"geometry problem solving","208","154","121","74.03846153846155","58.17307692307693"
"math word problem","186","76","116","40.86021505376344","62.365591397849464"
"logical reasoning","37","17","3","45.94594594594595","8.108108108108109"
"figure question answering","269","169","163","62.825278810408925","60.594795539033456"
"statistical reasoning","301","168","212","55.81395348837209","70.43189368770764"

torchrun --nproc-per-node= 8 run.py --data MathVista_MINI --model InternVL2-40B --verbose

The expected test results are:

**196 Chapter 1. Documentation**


"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","552","637","55.2","63.7"
"scientific reasoning","122","90","76","73.77049180327869","62.295081967213115"
"textbook question answering","158","101","99","63.92405063291139","62.65822784810127"
"numeric commonsense","144","34","58","23.61111111111111","40.27777777777778"
"arithmetic reasoning","353","147","229","41.64305949008499","64.87252124645893"
"visual question answering","179","92","103","51.39664804469274","57.54189944134078"
"geometry reasoning","239","155","131","64.85355648535564","54.811715481171554"
"algebraic reasoning","281","180","152","64.05693950177937","54.092526690391466"
"geometry problem solving","208","146","114","70.1923076923077","54.807692307692314"
"math word problem","186","65","135","34.946236559139784","72.58064516129032"
"logical reasoning","37","11","10","29.72972972972973","27.027027027027028"
"figure question answering","269","148","186","55.01858736059479","69.14498141263941"
"statistical reasoning","301","150","233","49.83388704318937","77.40863787375415"

torchrun --nproc-per-node= 1 run.py --data MathVista_MINI --model InternVL2-76B --verbose

The expected test results are:

"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","534","655","53.400000000000006","65.5"
"scientific reasoning","122","89","77","72.95081967213115","63.114754098360656"
"textbook question answering","158","100","106","63.29113924050633","67.08860759493672"
"numeric commonsense","144","42","64","29.166666666666668","44.44444444444444"
"arithmetic reasoning","353","154","218","43.626062322946176","61.756373937677054"
"visual question answering","179","95","89","53.072625698324025","49.72067039106145"
"geometry reasoning","239","143","160","59.83263598326359","66.94560669456067"
"algebraic reasoning","281","168","187","59.7864768683274","66.54804270462633"
"geometry problem solving","208","134","142","64.42307692307693","68.26923076923077"
"math word problem","186","73","143","39.247311827956985","76.88172043010752"
"logical reasoning","37","7","6","18.91891891891892","16.216216216216218"
"figure question answering","269","132","175","49.07063197026022","65.05576208178438"
"statistical reasoning","301","139","232","46.179401993355484","77.0764119601329"

**HallusionBench**

HallusionBench is a comprehensive benchmark designed to evaluate image-context reasoning in MLLMs, focusing on
identifying issues related to language hallucination and visual illusion. The dataset consists of 346 images paired with
1,129 questions crafted by human experts. These questions are divided into two categories: Visual Dependent (VD)
and Visual Supplement (VS), allowing the benchmark to assess the nuanced understanding and interpretation of visual
data by MLLMs.

1B

2B

4B

8B

26B

40B

76B

**1.23. Evaluation of InternVL2 Series 197**


torchrun --nproc-per-node= 8 run.py --data HallusionBench --model InternVL2-1B --verbose

The expected test results are:

"split","aAcc","fAcc","qAcc"
"Overall","54.363827549947416","23.98843930635838","21.978021978021978"
"VS","58.333333333333336","15.517241379310345","28.651685393258425"
"VD","51.945854483925544","28.26086956521739","17.689530685920577"
"VS_map","56.25","9.090909090909092","12.5"
"VD_illusion","48.61111111111111","25.806451612903224","8.333333333333332"
"VD_figure","58.75","36.58536585365854","23.076923076923077"
"VS_ocr","44.44444444444444","23.076923076923077","3.7037037037037033"
"VD_video","51.76470588235295","14.583333333333334","11.594202898550725"
"VD_ocr","78.65168539325843","58.139534883720934","55.81395348837209"
"VS_chart","66.15384615384615","17.5","47.368421052631575"
"VD_math","29.629629629629626","5.555555555555555","3.7037037037037033"
"VS_table","57.14285714285714","10.714285714285714","23.25581395348837"

result =(54.363827549947416 +23.98843930635838+ 21.978021978021978)/ 3 = 33.4

torchrun --nproc-per-node= 8 run.py --data HallusionBench --model InternVL2-2B --verbose

The expected test results are:

"split","aAcc","fAcc","qAcc"
"Overall","58.359621451104104","26.589595375722542","28.79120879120879"
"VS","65.27777777777779","24.137931034482758","41.57303370786517"
"VD","54.145516074450086","27.82608695652174","20.577617328519857"
"VS_chart","70.0","27.500000000000004","59.210526315789465"
"VD_math","38.88888888888889","2.7777777777777777","11.11111111111111"
"VS_table","65.17857142857143","14.285714285714285","37.2093023255814"
"VD_ocr","71.91011235955057","46.51162790697674","44.18604651162791"
"VD_figure","60.0","39.02439024390244","23.076923076923077"
"VD_illusion","57.638888888888886","32.25806451612903","23.61111111111111"
"VD_video","48.8235294117647","14.583333333333334","8.695652173913043"
"VS_map","64.0625","27.27272727272727","28.125"
"VS_ocr","55.55555555555556","26.923076923076923","14.814814814814813"

result =(58.359621451104104 +26.589595375722542 +28.79120879120879)/ 3 = 37.9

torchrun --nproc-per-node= 8 run.py --data HallusionBench --model InternVL2-4B --verbose

The expected test results are:

"split","aAcc","fAcc","qAcc"
"Overall","61.09358569926393","32.369942196531795","32.30769230769231"
"VD","56.17597292724196","30.0","22.743682310469314"
"VS","69.16666666666667","37.06896551724138","47.19101123595505"
"VS_map","56.25","27.27272727272727","15.625"
"VS_ocr","55.55555555555556","38.46153846153847","18.51851851851852"
"VD_ocr","75.28089887640449","51.162790697674424","51.162790697674424"
"VS_table","75.89285714285714","35.714285714285715","55.81395348837209"
"VD_figure","62.5","39.02439024390244","25.64102564102564"
(continues on next page)

**198 Chapter 1. Documentation**


```
(continued from previous page)
```
"VD_illusion","55.55555555555556","33.87096774193548","19.444444444444446"
"VD_video","48.8235294117647","8.333333333333332","7.246376811594203"
"VD_math","48.148148148148145","16.666666666666664","22.22222222222222"
"VS_chart","75.38461538461539","42.5","65.78947368421053"

result =(61.09358569926393+ 32.369942196531795+ 32.30769230769231)/ 3 = 41.9

torchrun --nproc-per-node= 8 run.py --data HallusionBench --model InternVL2-8B --verbose

The expected test results are:

"split","aAcc","fAcc","qAcc"
"Overall","64.03785488958991","35.83815028901734","35.824175824175825"
"VS","69.16666666666667","36.206896551724135","45.50561797752809"
"VD","60.913705583756354","35.65217391304348","29.602888086642597"
"VS_chart","76.15384615384615","42.5","63.1578947368421"
"VD_ocr","74.15730337078652","51.162790697674424","48.837209302325576"
"VD_figure","67.5","53.65853658536586","35.8974358974359"
"VD_video","51.17647058823529","14.583333333333334","11.594202898550725"
"VD_math","55.55555555555556","16.666666666666664","29.629629629629626"
"VD_illusion","64.58333333333334","40.32258064516129","31.944444444444443"
"VS_map","56.25","31.818181818181817","18.75"
"VS_ocr","53.70370370370371","26.923076923076923","11.11111111111111"
"VS_table","75.89285714285714","39.285714285714285","55.81395348837209"

result =(64.03785488958991+ 35.83815028901734+ 35.824175824175825)/ 3 = 45.2

torchrun --nproc-per-node= 8 run.py --data HallusionBench --model InternVL2-26B --verbose

The expected test results are:

"split","aAcc","fAcc","qAcc"
"Overall","67.2975814931651","43.641618497109825","41.098901098901095"
"VD","63.45177664974619","42.608695652173914","33.935018050541515"
"VS","73.61111111111111","45.689655172413794","52.24719101123596"
"VD_illusion","65.97222222222221","50.0","33.33333333333333"
"VS_chart","80.0","50.0","68.42105263157895"
"VD_ocr","77.52808988764045","58.139534883720934","55.81395348837209"
"VD_figure","72.5","53.65853658536586","43.58974358974359"
"VS_map","54.6875","22.727272727272727","18.75"
"VD_video","54.70588235294118","25.0","17.391304347826086"
"VS_ocr","51.85185185185185","34.61538461538461","14.814814814814813"
"VD_math","55.55555555555556","22.22222222222222","31.48148148148148"
"VS_table","87.5","67.85714285714286","72.09302325581395"

result =(67.2975814931651+ 43.641618497109825+ 41.098901098901095)/ 3 = 50.7

torchrun --nproc-per-node= 8 run.py --data HallusionBench --model InternVL2-40B --verbose

The expected test results are:

**1.23. Evaluation of InternVL2 Series 199**


"split","aAcc","fAcc","qAcc"
"Overall","71.39852786540484","51.73410404624278","47.69230769230769"
"VS","78.88888888888889","56.896551724137936","58.98876404494382"
"VD","66.83587140439933","49.130434782608695","40.43321299638989"
"VD_math","62.03703703703704","36.11111111111111","38.88888888888889"
"VD_ocr","80.89887640449437","62.7906976744186","60.46511627906976"
"VD_figure","85.0","78.04878048780488","69.23076923076923"
"VS_chart","84.61538461538461","60.0","76.31578947368422"
"VS_map","62.5","45.45454545454545","25.0"
"VS_ocr","72.22222222222221","53.84615384615385","44.44444444444444"
"VS_table","84.82142857142857","64.28571428571429","62.7906976744186"
"VD_video","52.94117647058824","20.833333333333336","15.942028985507244"
"VD_illusion","68.05555555555556","50.0","37.5"

result =(71.39852786540484+ 51.73410404624278+ 47.69230769230769) / 3 =56.9

torchrun --nproc-per-node= 1 run.py --data HallusionBench --model InternVL2-76B --verbose

The expected test results are:

"split","aAcc","fAcc","qAcc"
"Overall","71.1882229232387","48.26589595375722","46.15384615384615"
"VS","76.38888888888889","53.44827586206896","56.74157303370787"
"VD","68.02030456852792","45.65217391304348","39.35018050541516"
"VD_ocr","80.89887640449437","65.11627906976744","65.11627906976744"
"VS_chart","81.53846153846153","60.0","73.68421052631578"
"VD_video","60.588235294117645","25.0","20.28985507246377"
"VD_math","64.81481481481481","27.77777777777778","37.03703703703704"
"VD_illusion","62.5","40.32258064516129","29.166666666666668"
"VS_ocr","64.81481481481481","42.30769230769231","29.629629629629626"
"VD_figure","83.75","73.17073170731707","66.66666666666666"
"VS_table","82.14285714285714","60.71428571428571","62.7906976744186"
"VS_map","65.625","45.45454545454545","31.25"

result =(71.1882229232387+ 48.26589595375722+46.15384615384615) / 3 =55.2

**MMStar**

The MMStar dataset is an advanced multimodal benchmark designed to evaluate the capabilities of MLLMs. It com-
prises 1,500 carefully selected samples that are balanced and puried to ensure they exhibit visual dependency and
minimal data leakage. The dataset evaluates models across six core capabilities and 18 detailed axes, focusing on
complex multimodal tasks that require advanced reasoning and understanding of visual content.

1B

2B

4B

8B

26B

40B

76B

**200 Chapter 1. Documentation**


torchrun --nproc-per-node= 8 run.py --data MMStar --model InternVL2-1B --verbose

The expected test results are:

"split","Overall","coarse perception","fine-grained perception","instance reasoning",
˓→"logical reasoning","math","science & technology"
"none","0.452","0.588","0.368","0.548","0.352","0.46","0.396"

torchrun --nproc-per-node= 8 run.py --data MMStar --model InternVL2-2B --verbose

The expected test results are:

"split","Overall","coarse perception","fine-grained perception","instance reasoning",
˓→"logical reasoning","math","science & technology"
"none","0.5013333333333333","0.644","0.392","0.608","0.44","0.496","0.428"

torchrun --nproc-per-node= 8 run.py --data MMStar --model InternVL2-4B --verbose

The expected test results are:

"split","Overall","coarse perception","fine-grained perception","instance reasoning",
˓→"logical reasoning","math","science & technology"
"none","0.5426666666666666","0.672","0.384","0.624","0.532","0.588","0.456"

torchrun --nproc-per-node= 8 run.py --data MMStar --model InternVL2-8B --verbose

The expected test results are:

"split","Overall","coarse perception","fine-grained perception","instance reasoning",
˓→"logical reasoning","math","science & technology"
"none","0.62","0.704","0.504","0.68","0.656","0.672","0.504"

torchrun --nproc-per-node= 8 run.py --data MMStar --model InternVL2-26B --verbose

The expected test results are:

"split","Overall","coarse perception","fine-grained perception","instance reasoning",
˓→"logical reasoning","math","science & technology"
"none","0.612","0.716","0.544","0.688","0.6","0.624","0.5"

torchrun --nproc-per-node= 8 run.py --data MMStar --model InternVL2-40B --verbose

The expected test results are:

"split","Overall","coarse perception","fine-grained perception","instance reasoning",
˓→"logical reasoning","math","science & technology"
"none","0.654","0.692","0.528","0.716","0.696","0.72","0.572"

torchrun --nproc-per-node= 1 run.py --data MMStar --model InternVL2-76B --verbose

The expected test results are:

**1.23. Evaluation of InternVL2 Series 201**


"split","Overall","coarse perception","fine-grained perception","instance reasoning",
˓→"logical reasoning","math","science & technology"
"none","0.674","0.704","0.568","0.728","0.724","0.752","0.568"

**OCRBench**

OCRBench is a comprehensive evaluation benchmark designed to assess the OCR capabilities of MLLMs. It includes
ve components: Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA,
Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). The benchmark
encompasses data from 29 datasets, making it one of the most thorough OCR evaluation tools available. OCRBench
aims to reveal both the strengths and weaknesses of MLLMs, particularly in handling multilingual text, handwritten text,
non-semantic text, and mathematical expressions. The benchmark includes 1,000 question-answer pairs, all manually
veried for precision.

1B

2B

4B

8B

26B

40B

76B

torchrun --nproc-per-node= 8 run.py --data OCRBench --model InternVL2-1B --verbose

The expected test results are:

{
"Text Recognition": 243 ,
"Scene Text-centric VQA": 165 ,
"Doc-oriented VQA": 125 ,
"Key Information Extraction": 149 ,
"Handwritten Mathematical Expression Recognition": 72 ,
"Final Score": 754 ,
"Final Score Norm": 75.4
}

torchrun --nproc-per-node= 8 run.py --data OCRBench --model InternVL2-2B --verbose

The expected test results are:

{
"Text Recognition": 246 ,
"Scene Text-centric VQA": 170 ,
"Doc-oriented VQA": 133 ,
"Key Information Extraction": 167 ,
"Handwritten Mathematical Expression Recognition": 68 ,
"Final Score": 784 ,
"Final Score Norm": 78.4
}

**202 Chapter 1. Documentation**


torchrun --nproc-per-node= 8 run.py --data OCRBench --model InternVL2-4B --verbose

The expected test results are:

{
"Text Recognition": 237 ,
"Scene Text-centric VQA": 170 ,
"Doc-oriented VQA": 154 ,
"Key Information Extraction": 159 ,
"Handwritten Mathematical Expression Recognition": 68 ,
"Final Score": 788 ,
"Final Score Norm": 78.8
}

torchrun --nproc-per-node= 8 run.py --data OCRBench --model InternVL2-8B --verbose

The expected test results are:

{
"Text Recognition": 236 ,
"Scene Text-centric VQA": 175 ,
"Doc-oriented VQA": 156 ,
"Key Information Extraction": 162 ,
"Handwritten Mathematical Expression Recognition": 65 ,
"Final Score": 794 ,
"Final Score Norm": 79.4
}

torchrun --nproc-per-node= 8 run.py --data OCRBench --model InternVL2-26B --verbose

The expected test results are:

{
"Text Recognition": 250 ,
"Scene Text-centric VQA": 185 ,
"Doc-oriented VQA": 154 ,
"Key Information Extraction": 168 ,
"Handwritten Mathematical Expression Recognition": 68 ,
"Final Score": 825 ,
"Final Score Norm": 82.5
}

torchrun --nproc-per-node= 8 run.py --data OCRBench --model InternVL2-40B --verbose

The expected test results are:

{
"Text Recognition": 246 ,
"Scene Text-centric VQA": 181 ,
"Doc-oriented VQA": 160 ,
"Key Information Extraction": 175 ,
"Handwritten Mathematical Expression Recognition": 75 ,
"Final Score": 837 ,
(continues on next page)

**1.23. Evaluation of InternVL2 Series 203**


(continued from previous page)
"Final Score Norm": 83.7
}

torchrun --nproc-per-node= 1 run.py --data OCRBench --model InternVL2-76B --verbose

The expected test results are:

{
"Text Recognition": 244 ,
"Scene Text-centric VQA": 182 ,
"Doc-oriented VQA": 165 ,
"Key Information Extraction": 176 ,
"Handwritten Mathematical Expression Recognition": 72 ,
"Final Score": 839 ,
"Final Score Norm": 83.9
}

### MMMU

The MMMU dataset is a comprehensive benchmark designed to evaluate multimodal models on college-level tasks that
require domain-specic knowledge and reasoning. It includes 11,500 questions sourced from college exams, quizzes,
and textbooks, spanning six disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social
Science, and Tech & Engineering. These questions cover 30 subjects and feature 30 types of images, such as charts,
diagrams, maps, tables, and more.

1B

2B

4B

8B

26B

40B

76B

torchrun --nproc-per-node= 8 run.py --data MMMU_DEV_VAL --model InternVL2-1B --verbose

The expected test results are:

"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_
˓→Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_
˓→Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics",
˓→"Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing",
˓→"Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology",
˓→"Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities &
˓→Social Science","Science","Tech & Engineering"
"dev","0.34","0.2","0.0","0.2","0.2","0.4","0.4","0.0","0.4","0.0","0.2","0.4","0.4","0.2
˓→","0.0","0.6","0.6","0.4","0.2","0.6","0.6","0.6","0.2","0.2","0.0","0.4","0.4","0.8",
˓→"0.6","0.2","0.8","0.35","0.44","0.28","0.55","0.36","0.17142857142857143"
"validation","0.3688888888888889","0.2","0.2","0.23333333333333334","0.4666666666666667",
˓→"0.43333333333333335","0.4666666666666667","0.3333333333333333","0.4","0.
˓→3333333333333333","0.3333333333333333","0.5333333333333333","0.4666666666666667","0.
(continues on next page)

**204 Chapter 1. Documentation**


```
(continued from previous page)
˓→36666666666666664","0.4666666666666667","0.4","0.23333333333333334","0.4","0.
˓→43333333333333335","0.7666666666666667","0.43333333333333335","0.43333333333333335","0.
˓→4","0.16666666666666666","0.26666666666666666","0.26666666666666666","0.2","0.
˓→36666666666666664","0.26666666666666666","0.3","0.5","0.425","0.3333333333333333","0.
˓→35333333333333333","0.49166666666666664","0.3333333333333333","0.32857142857142857"
```
torchrun --nproc-per-node= 8 run.py --data MMMU_DEV_VAL --model InternVL2-2B --verbose

The expected test results are:

"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_
˓→Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_
˓→Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics",
˓→"Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing",
˓→"Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology",
˓→"Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities &
˓→Social Science","Science","Tech & Engineering"
"dev","0.3333333333333333","0.4","0.0","0.0","0.2","0.2","0.6","0.2","0.2","0.2","0.4",
˓→"0.6","0.2","0.8","0.6","0.2","0.6","0.0","0.4","0.8","0.2","0.2","0.2","0.8","0.8","0.
˓→0","0.2","0.2","0.2","0.0","0.6","0.25","0.44","0.24","0.5","0.28","0.3142857142857143"
"validation","0.36333333333333334","0.3333333333333333","0.4","0.26666666666666666","0.
˓→43333333333333335","0.36666666666666664","0.43333333333333335","0.23333333333333334",
˓→"0.3","0.4","0.3","0.4666666666666667","0.36666666666666664","0.36666666666666664","0.5
˓→","0.26666666666666666","0.4","0.23333333333333334","0.43333333333333335","0.
˓→7666666666666667","0.43333333333333335","0.3333333333333333","0.3","0.4","0.
˓→23333333333333334","0.3","0.2","0.26666666666666666","0.36666666666666664","0.
˓→36666666666666664","0.43333333333333335","0.39166666666666666","0.37333333333333335",
˓→"0.35333333333333333","0.5","0.2866666666666667","0.3238095238095238"

torchrun --nproc-per-node= 8 run.py --data MMMU_DEV_VAL --model InternVL2-4B --verbose

The expected test results are:

"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_
˓→Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_
˓→Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics",
˓→"Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing",
˓→"Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology",
˓→"Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities &
˓→Social Science","Science","Tech & Engineering"
"validation","0.47888888888888886","0.43333333333333335","0.5333333333333333","0.3","0.6
˓→","0.6","0.43333333333333335","0.36666666666666664","0.36666666666666664","0.
˓→3333333333333333","0.4","0.9","0.4666666666666667","0.5666666666666667","0.
˓→43333333333333335","0.4666666666666667","0.4","0.36666666666666664","0.5666666666666667
˓→","0.8333333333333334","0.5666666666666667","0.43333333333333335","0.36666666666666664
˓→","0.3333333333333333","0.26666666666666666","0.3333333333333333","0.43333333333333335
˓→","0.3333333333333333","0.6666666666666666","0.5666666666666667","0.7","0.
˓→6083333333333333","0.48","0.44666666666666666","0.6916666666666667","0.
˓→35333333333333333","0.3952380952380952"
"dev","0.4866666666666667","0.2","0.2","0.4","0.6","0.6","0.8","1.0","0.4","0.0","0.4",
˓→"0.6","0.2","0.6","0.4","0.4","0.4","0.0","1.0","0.8","0.6","0.6","0.2","0.6","0.6","0.
˓→4","0.4","0.2","0.8","0.6","0.6","0.55","0.48","0.4","0.8","0.44","0.37142857142857144"

**1.23. Evaluation of InternVL2 Series 205**


torchrun --nproc-per-node= 8 run.py --data MMMU_DEV_VAL --model InternVL2-8B --verbose

The expected test results are:

"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_
˓→Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_
˓→Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics",
˓→"Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing",
˓→"Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology",
˓→"Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities &
˓→Social Science","Science","Tech & Engineering"
"dev","0.49333333333333335","0.2","0.2","0.4","0.6","0.8","0.6","1.0","0.2","0.2","0.6",
˓→"0.6","0.4","0.2","0.6","0.4","0.6","0.0","1.0","1.0","0.6","0.6","0.2","0.6","0.4","0.
˓→2","0.6","0.4","0.6","0.4","0.6","0.55","0.44","0.44","0.8","0.44","0.4"
"validation","0.5177777777777778","0.5333333333333333","0.5333333333333333","0.3","0.7",
˓→"0.7","0.4666666666666667","0.5","0.5","0.7","0.6333333333333333","0.7","0.
˓→43333333333333335","0.5333333333333333","0.4666666666666667","0.4","0.3333333333333333
˓→","0.4666666666666667","0.7","0.9","0.5333333333333333","0.5333333333333333","0.
˓→3333333333333333","0.5","0.4","0.36666666666666664","0.3333333333333333","0.
˓→26666666666666666","0.6","0.5666666666666667","0.6","0.6166666666666667","0.
˓→49333333333333335","0.5","0.7","0.44666666666666666","0.4380952380952381"

torchrun --nproc-per-node= 8 run.py --data MMMU_DEV_VAL --model InternVL2-26B --verbose

The expected test results are:

"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_
˓→Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_
˓→Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics",
˓→"Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing",
˓→"Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology",
˓→"Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities &
˓→Social Science","Science","Tech & Engineering"
"dev","0.5266666666666666","0.4","0.4","0.2","0.8","0.8","0.6","0.4","0.4","0.0","0.6",
˓→"0.6","0.2","0.2","0.6","0.4","1.0","0.0","1.0","0.8","0.6","0.6","0.4","0.6","0.8","0.
˓→6","0.6","0.4","0.8","0.4","0.6","0.7","0.56","0.36","0.8","0.36","0.4857142857142857"
"validation","0.5122222222222222","0.43333333333333335","0.4666666666666667","0.
˓→26666666666666666","0.8","0.8666666666666667","0.5666666666666667","0.5666666666666667
˓→","0.3333333333333333","0.5666666666666667","0.4666666666666667","0.8333333333333334",
˓→"0.36666666666666664","0.4","0.5","0.4666666666666667","0.4","0.5333333333333333","0.7
˓→","0.9","0.5666666666666667","0.4666666666666667","0.36666666666666664","0.
˓→3333333333333333","0.4","0.3","0.3333333333333333","0.3333333333333333","0.6","0.6","0.
˓→6333333333333333","0.7","0.4533333333333333","0.4866666666666667","0.7083333333333334",
˓→"0.42","0.41904761904761906"

torchrun --nproc-per-node= 8 run.py --data MMMU_DEV_VAL --model InternVL2-40B --verbose

The expected test results are:

"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_
˓→Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_
˓→Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics",
˓→"Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing",
(continues on next page)

**206 Chapter 1. Documentation**


(continued from previous page)
˓→"Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology",
˓→"Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities &
˓→Social Science","Science","Tech & Engineering"
"validation","0.5522222222222222","0.4","0.6","0.36666666666666664","0.7","0.
˓→8666666666666667","0.5333333333333333","0.5333333333333333","0.4666666666666667","0.6",
˓→"0.5666666666666667","0.7333333333333333","0.36666666666666664","0.6","0.
˓→4666666666666667","0.4666666666666667","0.43333333333333335","0.5333333333333333","0.
˓→7666666666666667","0.8333333333333334","0.4666666666666667","0.5666666666666667","0.
˓→3333333333333333","0.43333333333333335","0.36666666666666664","0.3","0.7","0.
˓→5333333333333333","0.6333333333333333","0.8","0.6","0.65","0.49333333333333335","0.6",
˓→"0.7083333333333334","0.5","0.4523809523809524"
"dev","0.54","0.2","0.2","0.4","1.0","0.8","0.8","0.6","0.2","0.4","0.6","0.6","0.4","0.2
˓→","0.4","0.4","0.8","0.0","1.0","1.0","0.6","0.6","0.4","0.4","0.8","0.4","0.8","0.4",
˓→"0.8","0.4","0.6","0.7","0.48","0.56","0.85","0.32","0.45714285714285713"

torchrun --nproc-per-node= 1 run.py --data MMMU_DEV_VAL --model InternVL2-76B --verbose

The expected test results are:

"split","Overall","Accounting","Agriculture","Architecture_and_Engineering","Art","Art_
˓→Theory","Basic_Medical_Science","Biology","Chemistry","Clinical_Medicine","Computer_
˓→Science","Design","Diagnostics_and_Laboratory_Medicine","Economics","Electronics",
˓→"Energy_and_Power","Finance","Geography","History","Literature","Manage","Marketing",
˓→"Materials","Math","Mechanical_Engineering","Music","Pharmacy","Physics","Psychology",
˓→"Public_Health","Sociology","Art & Design","Business","Health & Medicine","Humanities &
˓→Social Science","Science","Tech & Engineering"
"validation","0.5822222222222222","0.5","0.6333333333333333","0.4666666666666667","0.
˓→7666666666666667","0.9666666666666667","0.5333333333333333","0.5","0.5","0.
˓→6666666666666666","0.6333333333333333","0.7666666666666667","0.43333333333333335","0.
˓→5333333333333333","0.6","0.4","0.6333333333333333","0.4666666666666667","0.7","0.9","0.
˓→7333333333333333","0.6","0.3","0.3","0.4666666666666667","0.3333333333333333","0.
˓→5666666666666667","0.5333333333333333","0.7","0.7","0.6333333333333333","0.
˓→7083333333333334","0.6","0.58","0.7333333333333333","0.46","0.5"
"dev","0.5666666666666667","0.2","0.2","0.4","0.8","0.8","0.8","1.0","0.2","0.4","0.6",
˓→"0.6","0.6","0.2","0.4","0.4","1.0","0.0","1.0","1.0","0.8","0.4","0.2","0.6","1.0","0.
˓→2","0.6","0.4","0.8","0.6","0.8","0.6","0.52","0.6","0.9","0.44","0.45714285714285713"

**RealWorldQA**

The RealWorldQA dataset is a benchmark designed to evaluate the real-world spatial understanding capabilities of
multimodal AI models. It consists of over 700 images, each accompanied by a question and a veriable answer, focusing
on various real-world scenarios, including those captured from vehicles. This dataset aims to test how well AI models
comprehend physical environments and spatial relations, enhancing their ability to interpret and analyze real-world
scenes.

1B

2B

4B

8B

26B

40B

**1.23. Evaluation of InternVL2 Series 207**


### 76B

torchrun --nproc-per-node= 8 run.py --data RealWorldQA --model InternVL2-1B --verbose

The expected test results are:

"split","Overall"
"none","0.5032679738562091"

torchrun --nproc-per-node= 8 run.py --data RealWorldQA --model InternVL2-2B --verbose

The expected test results are:

"split","Overall"
"none","0.5725490196078431"

torchrun --nproc-per-node= 8 run.py --data RealWorldQA --model InternVL2-4B --verbose

The expected test results are:

"split","Overall"
"none","0.6065359477124183"

torchrun --nproc-per-node= 8 run.py --data RealWorldQA --model InternVL2-8B --verbose

The expected test results are:

"split","Overall"
"none","0.6444444444444445"

torchrun --nproc-per-node= 8 run.py --data RealWorldQA --model InternVL2-26B --verbose

The expected test results are:

"split","Overall"
"none","0.6836601307189543"

torchrun --nproc-per-node= 8 run.py --data RealWorldQA --model InternVL2-40B --verbose

The expected test results are:

"split","Overall"
"none","0.7176470588235294"

torchrun --nproc-per-node= 1 run.py --data RealWorldQA --model InternVL2-76B --verbose

The expected test results are:

"split","Overall"
"none","0.7215686274509804"

**208 Chapter 1. Documentation**


**MMVet (GPT-4-Turbo)**

The MM-Vet dataset is a comprehensive benchmark designed to evaluate the integrated capabilities of MLLMs. It
encompasses six core vision-language (VL) capabilities: recognition, knowledge, optical character recognition (OCR),
spatial awareness, language generation, and math. The dataset includes 200 images and 218 questions, each requir-
ing one or more of these capabilities to answer. The evaluation uses an open-ended LLM-based approach, allowing
assessment across various answer styles and question types.

1B

2B

4B

8B

26B

40B

76B

torchrun --nproc-per-node= 8 run.py --data MMVet --model InternVL2-1B --verbose

The expected test results are:

"Category","tot","acc"
"rec","187","37.27272727272725"
"ocr","108","37.96296296296297"
"know","84","14.76190476190476"
"gen","80","14.624999999999996"
"spat","75","33.733333333333334"
"math","26","22.692307692307693"
"Overall","218","33.25688073394493"

torchrun --nproc-per-node= 8 run.py --data MMVet --model InternVL2-2B --verbose

The expected test results are:

"Category","tot","acc"
"rec","187","41.71122994652404"
"ocr","108","44.62962962962963"
"know","84","24.999999999999993"
"gen","80","26.25"
"spat","75","40.800000000000004"
"math","26","30.76923076923077"
"Overall","218","39.541284403669714"

torchrun --nproc-per-node= 8 run.py --data MMVet --model InternVL2-4B --verbose

The expected test results are:

"Category","tot","acc"
"rec","187","50.000000000000036"
"ocr","108","58.611111111111114"
"know","84","37.26190476190476"
"gen","80","36.499999999999986"
(continues on next page)

**1.23. Evaluation of InternVL2 Series 209**


```
(continued from previous page)
```
"spat","75","47.20000000000001"
"math","26","57.30769230769231"
"Overall","218","51.00917431192664"

torchrun --nproc-per-node= 8 run.py --data MMVet --model InternVL2-8B --verbose

The expected test results are:

"Category","tot","acc"
"rec","187","51.81818181818184"
"ocr","108","63.42592592592594"
"know","84","36.904761904761905"
"gen","80","35.87499999999999"
"spat","75","61.86666666666667"
"math","26","60.769230769230774"
"Overall","218","54.174311926605526"

torchrun --nproc-per-node= 8 run.py --data MMVet --model InternVL2-26B --verbose

The expected test results are:

"Category","tot","acc"
"rec","187","62.67379679144389"
"ocr","108","69.72222222222223"
"know","84","50.119047619047606"
"gen","80","48.62499999999999"
"spat","75","61.066666666666656"
"math","26","61.53846153846154"
"Overall","218","62.1100917431193"

torchrun --nproc-per-node= 8 run.py --data MMVet --model InternVL2-40B --verbose

The expected test results are:

"Category","tot","acc"
"rec","187","66.25668449197867"
"ocr","108","70.18518518518522"
"know","84","54.40476190476189"
"gen","80","54.74999999999998"
"spat","75","68.53333333333332"
"math","26","64.23076923076924"
"Overall","218","65.50458715596335"

torchrun --nproc-per-node= 1 run.py --data MMVet --model InternVL2-76B --verbose

The expected test results are:

"Category","tot","acc"
"rec","187","65.66844919786104"
"ocr","108","70.09259259259262"
"know","84","58.3333333333333"
"gen","80","58.49999999999997"
(continues on next page)

**210 Chapter 1. Documentation**


```
(continued from previous page)
```
"spat","75","60.79999999999999"
"math","26","75.76923076923077"
"Overall","218","65.7339449541285"

Note that because the version of GPT-4 used for scoring diers from the ocial server, the scores tested by VLMEvalKit
will be slightly dierent.

**LLaVA-Bench (GPT-4-Turbo)**

The LLaVA-Bench-in-the-Wild dataset is designed to evaluate the capabilities of MLLMs in handling more complex
and diverse visual tasks. It includes a set of 24 images with 60 associated questions, covering a range of indoor and
outdoor scenes, memes, paintings, and sketches. Each image is paired with detailed, manually curated descriptions and
questions that test the model’s generalizability to novel domains.

1B

2B

4B

8B

26B

40B

76B

torchrun --nproc-per-node= 8 run.py --data LLaVABench --model InternVL2-1B --verbose

The expected test results are:

"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","51.6","39.5","76.5"
"detail","58.9","37.3","63.3"
"conv","43.0","40.0","92.9"
"complex","54.9","40.4","73.6"

torchrun --nproc-per-node= 8 run.py --data LLaVABench --model InternVL2-2B --verbose

The expected test results are:

"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","62.5","47.8","76.5"
"detail","61.8","42.0","68.0"
"complex","63.5","46.1","72.5"
"conv","61.7","55.9","90.6"

torchrun --nproc-per-node= 8 run.py --data LLaVABench --model InternVL2-4B --verbose

The expected test results are:

"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","68.2","51.0","74.8"
"conv","62.3","55.3","88.8"
"detail","65.3","42.7","65.3"
"complex","74.0","52.9","71.4"

**1.23. Evaluation of InternVL2 Series 211**


torchrun --nproc-per-node= 8 run.py --data LLaVABench --model InternVL2-8B --verbose

The expected test results are:

"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","73.2","53.3","72.8"
"complex","86.1","61.8","71.8"
"conv","61.6","54.7","88.8"
"detail","63.5","36.0","56.7"

torchrun --nproc-per-node= 8 run.py --data LLaVABench --model InternVL2-26B --verbose

The expected test results are:

"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","92.3","68.0","73.7"
"detail","85.6","51.3","60.0"
"complex","99.0","73.6","74.3"
"conv","86.8","73.5","84.7"

torchrun --nproc-per-node= 8 run.py --data LLaVABench --model InternVL2-40B --verbose

The expected test results are:

"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","100.5","72.7","72.3"
"detail","90.4","56.7","62.7"
"complex","104.4","76.1","72.9"
"conv","101.5","81.2","80.0"

torchrun --nproc-per-node= 1 run.py --data LLaVABench --model InternVL2-76B --verbose

The expected test results are:

"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","99.3","71.7","72.2"
"detail","92.1","54.7","59.3"
"complex","107.7","79.6","73.9"
"conv","91.2","73.5","80.6"

**VideoMME**

The Video-MME dataset is a comprehensive benchmark designed to evaluate the capabilities of MLLMs in video
analysis. It is therst benchmark specically tailored for this purpose, focusing on a high-quality assessment of models’
performance in processing sequential visual data.

1B

2B

4B

8B

26B

**212 Chapter 1. Documentation**


### 40B

### 76B

When testing without subtitles:

torchrun --nproc-per-node= 8 run.py --data Video-MME --model InternVL2-1B --verbose --
˓→nframe 16

The expected test results are:

{
"short": {
"overall":"0.5289",
"domain": {
"Knowledge": "0.5481",
"Film & Television": "0.6167",
"Sports Competition": "0.4667",
"Artistic Performance":"0.5333",
"Life Record":"0.5143",
"Multilingual":"0.4000"
},
"sub_category": {
"Humanity & History": "0.3333",
"Literature & Art":"0.4000",
"Biology & Medicine": "0.7000",
"Finance & Commerce": "0.6333",
"Astronomy": "0.5667",
"Geography": "0.5333",
"Law":"0.6000",
"Life Tip": "0.5333",
"Technology":"0.6333",
"Animation": "0.6000",
"Movie & TV Show":"0.7333",
"Documentary":"0.5333",
"News Report":"0.6000",
"Esports":"0.3667",
"Basketball":"0.3667",
"Football": "0.5333",
"Athletics": "0.5333",
"Other Sports":"0.5333",
"Stage Play":"0.7333",
"Magic Show":"0.3333",
"Variety Show":"0.6333",
"Acrobatics":"0.4333",
"Handicraft":"0.4667",
"Food": "0.5000",
"Fashion":"0.6333",
"Daily Life":"0.4000",
"Travel":"0.6333",
"Pet & Animal":"0.6667",
"Exercise": "0.3000",
"Multilingual":"0.4000"
},
"task_type": {
(continues on next page)

**1.23. Evaluation of InternVL2 Series 213**


```
(continued from previous page)
"Temporal Perception":"0.6667",
"Spatial Perception": "0.6000",
"Attribute Perception":"0.6721",
"Action Recognition": "0.4427",
"Object Recognition": "0.4821",
"OCR Problems":"0.6316",
"Counting Problem":"0.3040",
"Temporal Reasoning": "0.6154",
"Spatial Reasoning": "0.6667",
"Action Reasoning":"0.6170",
"Object Reasoning":"0.4750",
"Information Synopsis":"0.7073"
}
},
"medium": {
"overall":"0.4144",
"domain": {
"Knowledge": "0.3630",
"Film & Television": "0.5250",
"Sports Competition": "0.3933",
"Artistic Performance":"0.4750",
"Life Record":"0.3952",
"Multilingual":"0.4333"
},
"sub_category": {
"Humanity & History": "0.2000",
"Literature & Art":"0.4000",
"Biology & Medicine": "0.5000",
"Finance & Commerce": "0.4333",
"Astronomy": "0.4333",
"Geography": "0.2333",
"Law":"0.4000",
"Life Tip": "0.4333",
"Technology":"0.2333",
"Animation": "0.3333",
"Movie & TV Show":"0.5333",
"Documentary":"0.6000",
"News Report":"0.6333",
"Esports":"0.5000",
"Basketball":"0.1333",
"Football": "0.4333",
"Athletics": "0.3333",
"Other Sports":"0.5667",
"Stage Play":"0.5667",
"Magic Show":"0.3333",
"Variety Show":"0.5000",
"Acrobatics":"0.5000",
"Handicraft":"0.4667",
"Food": "0.3000",
"Fashion":"0.3667",
"Daily Life":"0.3333",
"Travel":"0.4333",
(continues on next page)
```
**214 Chapter 1. Documentation**


```
(continued from previous page)
"Pet & Animal":"0.4000",
"Exercise": "0.4667",
"Multilingual":"0.4333"
},
"task_type": {
"Temporal Perception":"0.3871",
"Spatial Perception": "0.6190",
"Attribute Perception":"0.4110",
"Action Recognition": "0.3613",
"Object Recognition": "0.5000",
"OCR Problems":"0.4706",
"Counting Problem":"0.2526",
"Temporal Reasoning": "0.2740",
"Spatial Reasoning": "0.6667",
"Action Reasoning":"0.3276",
"Object Reasoning":"0.4179",
"Information Synopsis":"0.5897"
}
},
"long": {
"overall":"0.3333",
"domain": {
"Knowledge": "0.3259",
"Film & Television": "0.3250",
"Sports Competition": "0.3000",
"Artistic Performance":"0.3167",
"Life Record":"0.3762",
"Multilingual":"0.3667"
},
"sub_category": {
"Humanity & History": "0.3333",
"Literature & Art":"0.3667",
"Biology & Medicine": "0.3333",
"Finance & Commerce": "0.4667",
"Astronomy": "0.2000",
"Geography": "0.3000",
"Law":"0.2667",
"Life Tip": "0.3000",
"Technology":"0.3667",
"Animation": "0.2000",
"Movie & TV Show":"0.4667",
"Documentary":"0.4000",
"News Report":"0.2333",
"Esports":"0.4000",
"Basketball":"0.3333",
"Football": "0.2333",
"Athletics": "0.1333",
"Other Sports":"0.4000",
"Stage Play":"0.4000",
"Magic Show":"0.2667",
"Variety Show":"0.1333",
"Acrobatics":"0.4667",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 215**


```
(continued from previous page)
"Handicraft":"0.5333",
"Food": "0.4333",
"Fashion":"0.3333",
"Daily Life":"0.3667",
"Travel":"0.2000",
"Pet & Animal":"0.4333",
"Exercise": "0.3333",
"Multilingual":"0.3667"
},
"task_type": {
"Temporal Perception":"0.3333",
"Spatial Perception": "0.0000",
"Attribute Perception":"0.5185",
"Action Recognition": "0.3016",
"Object Recognition": "0.2963",
"OCR Problems":"0.5000",
"Counting Problem":"0.1250",
"Temporal Reasoning": "0.2857",
"Spatial Reasoning": "0.6364",
"Action Reasoning":"0.2556",
"Object Reasoning":"0.3042",
"Information Synopsis":"0.5153"
}
},
"overall": {
"overall":"0.4256",
"domain": {
"Knowledge": "0.4123",
"Film & Television": "0.4889",
"Sports Competition": "0.3867",
"Artistic Performance":"0.4417",
"Life Record":"0.4286",
"Multilingual":"0.4000"
},
"sub_category": {
"Humanity & History": "0.2889",
"Literature & Art":"0.3889",
"Biology & Medicine": "0.5111",
"Finance & Commerce": "0.5111",
"Astronomy": "0.4000",
"Geography": "0.3556",
"Law":"0.4222",
"Life Tip": "0.4222",
"Technology":"0.4111",
"Animation": "0.3778",
"Movie & TV Show":"0.5778",
"Documentary":"0.5111",
"News Report":"0.4889",
"Esports":"0.4222",
"Basketball":"0.2778",
"Football": "0.4000",
"Athletics": "0.3333",
(continues on next page)
```
**216 Chapter 1. Documentation**


(continued from previous page)
"Other Sports":"0.5000",
"Stage Play":"0.5667",
"Magic Show":"0.3111",
"Variety Show":"0.4222",
"Acrobatics":"0.4667",
"Handicraft":"0.4889",
"Food": "0.4111",
"Fashion":"0.4444",
"Daily Life":"0.3667",
"Travel":"0.4222",
"Pet & Animal":"0.5000",
"Exercise": "0.3667",
"Multilingual":"0.4000"
},
"task_type": {
"Temporal Perception":"0.4727",
"Spatial Perception": "0.5741",
"Attribute Perception":"0.5676",
"Action Recognition": "0.3834",
"Object Recognition": "0.4605",
"OCR Problems":"0.5396",
"Counting Problem":"0.2537",
"Temporal Reasoning": "0.3051",
"Spatial Reasoning": "0.6607",
"Action Reasoning":"0.3298",
"Object Reasoning":"0.3678",
"Information Synopsis":"0.5820"
}
}
}

When testing with subtitles:

torchrun --nproc-per-node= 8 run.py --data Video-MME --model InternVL2-1B --verbose --
˓→nframe 16 --use-subtitle

The expected test results are:

{
"short": {
"overall":"0.5433",
"domain": {
"Knowledge": "0.5630",
"Film & Television": "0.6000",
"Sports Competition": "0.4933",
"Artistic Performance":"0.5167",
"Life Record":"0.5571",
"Multilingual":"0.4000"
},
"sub_category": {
"Humanity & History": "0.3333",
"Literature & Art":"0.4000",
(continues on next page)

**1.23. Evaluation of InternVL2 Series 217**


```
(continued from previous page)
"Biology & Medicine": "0.7667",
"Finance & Commerce": "0.6000",
"Astronomy": "0.6000",
"Geography": "0.5000",
"Law":"0.6667",
"Life Tip": "0.6000",
"Technology":"0.6000",
"Animation": "0.5667",
"Movie & TV Show":"0.7333",
"Documentary":"0.5000",
"News Report":"0.6000",
"Esports":"0.4333",
"Basketball":"0.4000",
"Football": "0.5000",
"Athletics": "0.5000",
"Other Sports":"0.6333",
"Stage Play":"0.7667",
"Magic Show":"0.3333",
"Variety Show":"0.5333",
"Acrobatics":"0.4333",
"Handicraft":"0.5000",
"Food": "0.6000",
"Fashion":"0.6333",
"Daily Life":"0.4333",
"Travel":"0.7333",
"Pet & Animal":"0.6667",
"Exercise": "0.3333",
"Multilingual":"0.4000"
},
"task_type": {
"Temporal Perception":"0.5556",
"Spatial Perception": "0.5667",
"Attribute Perception":"0.6557",
"Action Recognition": "0.4656",
"Object Recognition": "0.5238",
"OCR Problems":"0.6667",
"Counting Problem":"0.3120",
"Temporal Reasoning": "0.4615",
"Spatial Reasoning": "0.6296",
"Action Reasoning":"0.5957",
"Object Reasoning":"0.5375",
"Information Synopsis":"0.7561"
}
},
"medium": {
"overall":"0.4289",
"domain": {
"Knowledge": "0.4111",
"Film & Television": "0.5250",
"Sports Competition": "0.4000",
"Artistic Performance":"0.4917",
"Life Record":"0.3714",
(continues on next page)
```
**218 Chapter 1. Documentation**


```
(continued from previous page)
"Multilingual":"0.5000"
},
"sub_category": {
"Humanity & History": "0.3667",
"Literature & Art":"0.4333",
"Biology & Medicine": "0.5667",
"Finance & Commerce": "0.5000",
"Astronomy": "0.5333",
"Geography": "0.3333",
"Law":"0.3333",
"Life Tip": "0.4000",
"Technology":"0.2333",
"Animation": "0.2667",
"Movie & TV Show":"0.5000",
"Documentary":"0.6333",
"News Report":"0.7000",
"Esports":"0.5000",
"Basketball":"0.1667",
"Football": "0.4333",
"Athletics": "0.3667",
"Other Sports":"0.5333",
"Stage Play":"0.6333",
"Magic Show":"0.4333",
"Variety Show":"0.4333",
"Acrobatics":"0.4667",
"Handicraft":"0.5000",
"Food": "0.3333",
"Fashion":"0.3333",
"Daily Life":"0.3000",
"Travel":"0.4000",
"Pet & Animal":"0.3000",
"Exercise": "0.4333",
"Multilingual":"0.5000"
},
"task_type": {
"Temporal Perception":"0.4194",
"Spatial Perception": "0.6667",
"Attribute Perception":"0.4658",
"Action Recognition": "0.3613",
"Object Recognition": "0.4924",
"OCR Problems":"0.4265",
"Counting Problem":"0.2632",
"Temporal Reasoning": "0.2877",
"Spatial Reasoning": "0.7222",
"Action Reasoning":"0.3276",
"Object Reasoning":"0.4403",
"Information Synopsis":"0.6538"
}
},
"long": {
"overall":"0.3689",
"domain": {
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 219**


```
(continued from previous page)
"Knowledge": "0.3852",
"Film & Television": "0.3833",
"Sports Competition": "0.3267",
"Artistic Performance":"0.3417",
"Life Record":"0.3905",
"Multilingual":"0.3333"
},
"sub_category": {
"Humanity & History": "0.2333",
"Literature & Art":"0.4333",
"Biology & Medicine": "0.4333",
"Finance & Commerce": "0.6000",
"Astronomy": "0.2667",
"Geography": "0.2667",
"Law":"0.5000",
"Life Tip": "0.4333",
"Technology":"0.3000",
"Animation": "0.2667",
"Movie & TV Show":"0.4667",
"Documentary":"0.5000",
"News Report":"0.3000",
"Esports":"0.3667",
"Basketball":"0.2667",
"Football": "0.3667",
"Athletics": "0.2000",
"Other Sports":"0.4333",
"Stage Play":"0.4333",
"Magic Show":"0.2333",
"Variety Show":"0.2333",
"Acrobatics":"0.4667",
"Handicraft":"0.4667",
"Food": "0.4333",
"Fashion":"0.3667",
"Daily Life":"0.4000",
"Travel":"0.1667",
"Pet & Animal":"0.5333",
"Exercise": "0.3667",
"Multilingual":"0.3333"
},
"task_type": {
"Temporal Perception":"0.3333",
"Spatial Perception": "0.0000",
"Attribute Perception":"0.5185",
"Action Recognition": "0.3016",
"Object Recognition": "0.3148",
"OCR Problems":"0.2857",
"Counting Problem":"0.1875",
"Temporal Reasoning": "0.2637",
"Spatial Reasoning": "0.5455",
"Action Reasoning":"0.3278",
"Object Reasoning":"0.3667",
"Information Synopsis":"0.5521"
(continues on next page)
```
**220 Chapter 1. Documentation**


```
(continued from previous page)
}
},
"overall": {
"overall":"0.4470",
"domain": {
"Knowledge": "0.4531",
"Film & Television": "0.5028",
"Sports Competition": "0.4067",
"Artistic Performance":"0.4500",
"Life Record":"0.4397",
"Multilingual":"0.4111"
},
"sub_category": {
"Humanity & History": "0.3111",
"Literature & Art":"0.4222",
"Biology & Medicine": "0.5889",
"Finance & Commerce": "0.5667",
"Astronomy": "0.4667",
"Geography": "0.3667",
"Law":"0.5000",
"Life Tip": "0.4778",
"Technology":"0.3778",
"Animation": "0.3667",
"Movie & TV Show":"0.5667",
"Documentary":"0.5444",
"News Report":"0.5333",
"Esports":"0.4333",
"Basketball":"0.2778",
"Football": "0.4333",
"Athletics": "0.3556",
"Other Sports":"0.5333",
"Stage Play":"0.6111",
"Magic Show":"0.3333",
"Variety Show":"0.4000",
"Acrobatics":"0.4556",
"Handicraft":"0.4889",
"Food": "0.4556",
"Fashion":"0.4444",
"Daily Life":"0.3778",
"Travel":"0.4333",
"Pet & Animal":"0.5000",
"Exercise": "0.3778",
"Multilingual":"0.4111"
},
"task_type": {
"Temporal Perception":"0.4545",
"Spatial Perception": "0.5741",
"Attribute Perception":"0.5766",
"Action Recognition": "0.3930",
"Object Recognition": "0.4802",
"OCR Problems":"0.5108",
"Counting Problem":"0.2724",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 221**


(continued from previous page)
"Temporal Reasoning": "0.2881",
"Spatial Reasoning": "0.6429",
"Action Reasoning":"0.3719",
"Object Reasoning":"0.4185",
"Information Synopsis":"0.6285"
}
}
}

When testing without subtitles:

torchrun --nproc-per-node= 8 run.py --data Video-MME --model InternVL2-2B --verbose --
˓→nframe 16

The expected test results are:

{
"short": {
"overall":"0.5756",
"domain": {
"Knowledge": "0.5593",
"Film & Television": "0.6417",
"Sports Competition": "0.5800",
"Artistic Performance":"0.5917",
"Life Record":"0.5810",
"Multilingual":"0.3333"
},
"sub_category": {
"Humanity & History": "0.5333",
"Literature & Art":"0.4333",
"Biology & Medicine": "0.6667",
"Finance & Commerce": "0.4667",
"Astronomy": "0.5333",
"Geography": "0.6000",
"Law":"0.5667",
"Life Tip": "0.6667",
"Technology":"0.5667",
"Animation": "0.6000",
"Movie & TV Show":"0.6000",
"Documentary":"0.6000",
"News Report":"0.7667",
"Esports":"0.5667",
"Basketball":"0.4667",
"Football": "0.6333",
"Athletics": "0.5667",
"Other Sports":"0.6667",
"Stage Play":"0.7333",
"Magic Show":"0.4333",
"Variety Show":"0.6667",
"Acrobatics":"0.5333",
"Handicraft":"0.4000",
"Food": "0.6000",
(continues on next page)

**222 Chapter 1. Documentation**


```
(continued from previous page)
"Fashion":"0.5333",
"Daily Life":"0.6667",
"Travel":"0.6000",
"Pet & Animal":"0.7667",
"Exercise": "0.5000",
"Multilingual":"0.3333"
},
"task_type": {
"Temporal Perception":"0.7222",
"Spatial Perception": "0.7333",
"Attribute Perception":"0.6967",
"Action Recognition": "0.5115",
"Object Recognition": "0.5536",
"OCR Problems":"0.7368",
"Counting Problem":"0.3120",
"Temporal Reasoning": "0.3846",
"Spatial Reasoning": "0.7407",
"Action Reasoning":"0.6809",
"Object Reasoning":"0.5375",
"Information Synopsis":"0.6951"
}
},
"medium": {
"overall":"0.4067",
"domain": {
"Knowledge": "0.3741",
"Film & Television": "0.4917",
"Sports Competition": "0.3333",
"Artistic Performance":"0.5417",
"Life Record":"0.3762",
"Multilingual":"0.4000"
},
"sub_category": {
"Humanity & History": "0.2000",
"Literature & Art":"0.4333",
"Biology & Medicine": "0.4000",
"Finance & Commerce": "0.3667",
"Astronomy": "0.4000",
"Geography": "0.3000",
"Law":"0.5333",
"Life Tip": "0.5000",
"Technology":"0.2333",
"Animation": "0.3000",
"Movie & TV Show":"0.5667",
"Documentary":"0.5000",
"News Report":"0.6000",
"Esports":"0.3333",
"Basketball":"0.2000",
"Football": "0.2667",
"Athletics": "0.5000",
"Other Sports":"0.3667",
"Stage Play":"0.6667",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 223**


```
(continued from previous page)
"Magic Show":"0.5000",
"Variety Show":"0.5000",
"Acrobatics":"0.5000",
"Handicraft":"0.4333",
"Food": "0.2000",
"Fashion":"0.2667",
"Daily Life":"0.3333",
"Travel":"0.4333",
"Pet & Animal":"0.3667",
"Exercise": "0.6000",
"Multilingual":"0.4000"
},
"task_type": {
"Temporal Perception":"0.2903",
"Spatial Perception": "0.5238",
"Attribute Perception":"0.4932",
"Action Recognition": "0.3025",
"Object Recognition": "0.4924",
"OCR Problems":"0.3676",
"Counting Problem":"0.2737",
"Temporal Reasoning": "0.3151",
"Spatial Reasoning": "0.6667",
"Action Reasoning":"0.3966",
"Object Reasoning":"0.4104",
"Information Synopsis":"0.5769"
}
},
"long": {
"overall":"0.3689",
"domain": {
"Knowledge": "0.3444",
"Film & Television": "0.3500",
"Sports Competition": "0.3933",
"Artistic Performance":"0.3417",
"Life Record":"0.4000",
"Multilingual":"0.4333"
},
"sub_category": {
"Humanity & History": "0.3000",
"Literature & Art":"0.4667",
"Biology & Medicine": "0.3667",
"Finance & Commerce": "0.3667",
"Astronomy": "0.2333",
"Geography": "0.2333",
"Law":"0.4667",
"Life Tip": "0.3000",
"Technology":"0.3667",
"Animation": "0.2333",
"Movie & TV Show":"0.4333",
"Documentary":"0.4333",
"News Report":"0.3000",
"Esports":"0.4333",
(continues on next page)
```
**224 Chapter 1. Documentation**


```
(continued from previous page)
"Basketball":"0.3000",
"Football": "0.3333",
"Athletics": "0.3667",
"Other Sports":"0.5333",
"Stage Play":"0.3333",
"Magic Show":"0.3667",
"Variety Show":"0.1667",
"Acrobatics":"0.5000",
"Handicraft":"0.5000",
"Food": "0.2000",
"Fashion":"0.3667",
"Daily Life":"0.4000",
"Travel":"0.2667",
"Pet & Animal":"0.6667",
"Exercise": "0.4000",
"Multilingual":"0.4333"
},
"task_type": {
"Temporal Perception":"0.0000",
"Spatial Perception": "0.3333",
"Attribute Perception":"0.3704",
"Action Recognition": "0.3968",
"Object Recognition": "0.4074",
"OCR Problems":"0.3571",
"Counting Problem":"0.2292",
"Temporal Reasoning": "0.3077",
"Spatial Reasoning": "0.5455",
"Action Reasoning":"0.3056",
"Object Reasoning":"0.3375",
"Information Synopsis":"0.5399"
}
},
"overall": {
"overall":"0.4504",
"domain": {
"Knowledge": "0.4259",
"Film & Television": "0.4944",
"Sports Competition": "0.4356",
"Artistic Performance":"0.4917",
"Life Record":"0.4524",
"Multilingual":"0.3889"
},
"sub_category": {
"Humanity & History": "0.3444",
"Literature & Art":"0.4444",
"Biology & Medicine": "0.4778",
"Finance & Commerce": "0.4000",
"Astronomy": "0.3889",
"Geography": "0.3778",
"Law":"0.5222",
"Life Tip": "0.4889",
"Technology":"0.3889",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 225**


(continued from previous page)
"Animation": "0.3778",
"Movie & TV Show":"0.5333",
"Documentary":"0.5111",
"News Report":"0.5556",
"Esports":"0.4444",
"Basketball":"0.3222",
"Football": "0.4111",
"Athletics": "0.4778",
"Other Sports":"0.5222",
"Stage Play":"0.5778",
"Magic Show":"0.4333",
"Variety Show":"0.4444",
"Acrobatics":"0.5111",
"Handicraft":"0.4444",
"Food": "0.3333",
"Fashion":"0.3889",
"Daily Life":"0.4667",
"Travel":"0.4333",
"Pet & Animal":"0.6000",
"Exercise": "0.5000",
"Multilingual":"0.3889"
},
"task_type": {
"Temporal Perception":"0.4000",
"Spatial Perception": "0.6296",
"Attribute Perception":"0.5901",
"Action Recognition": "0.4089",
"Object Recognition": "0.5085",
"OCR Problems":"0.5180",
"Counting Problem":"0.2836",
"Temporal Reasoning": "0.3164",
"Spatial Reasoning": "0.6786",
"Action Reasoning":"0.3860",
"Object Reasoning":"0.3943",
"Information Synopsis":"0.5882"
}
}
}

When testing with subtitles:

torchrun --nproc-per-node= 8 run.py --data Video-MME --model InternVL2-2B --verbose --
˓→nframe 16 --use-subtitle

The expected test results are:

{
"short": {
"overall":"0.5978",
"domain": {
"Knowledge": "0.5926",
"Film & Television": "0.6583",
(continues on next page)

**226 Chapter 1. Documentation**


```
(continued from previous page)
"Sports Competition": "0.5867",
"Artistic Performance":"0.6083",
"Life Record":"0.5952",
"Multilingual":"0.4333"
},
"sub_category": {
"Humanity & History": "0.4667",
"Literature & Art":"0.5333",
"Biology & Medicine": "0.8000",
"Finance & Commerce": "0.5333",
"Astronomy": "0.5667",
"Geography": "0.6333",
"Law":"0.6000",
"Life Tip": "0.6333",
"Technology":"0.5667",
"Animation": "0.5667",
"Movie & TV Show":"0.6333",
"Documentary":"0.6333",
"News Report":"0.8000",
"Esports":"0.5667",
"Basketball":"0.4333",
"Football": "0.6667",
"Athletics": "0.6333",
"Other Sports":"0.6333",
"Stage Play":"0.7000",
"Magic Show":"0.5000",
"Variety Show":"0.7000",
"Acrobatics":"0.5333",
"Handicraft":"0.4000",
"Food": "0.6667",
"Fashion":"0.5333",
"Daily Life":"0.6667",
"Travel":"0.5667",
"Pet & Animal":"0.7333",
"Exercise": "0.6000",
"Multilingual":"0.4333"
},
"task_type": {
"Temporal Perception":"0.8333",
"Spatial Perception": "0.6333",
"Attribute Perception":"0.7213",
"Action Recognition": "0.5496",
"Object Recognition": "0.5536",
"OCR Problems":"0.7368",
"Counting Problem":"0.3440",
"Temporal Reasoning": "0.3077",
"Spatial Reasoning": "0.8148",
"Action Reasoning":"0.7021",
"Object Reasoning":"0.5500",
"Information Synopsis":"0.7683"
}
},
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 227**


```
(continued from previous page)
"medium": {
"overall":"0.4367",
"domain": {
"Knowledge": "0.4444",
"Film & Television": "0.4833",
"Sports Competition": "0.3600",
"Artistic Performance":"0.5833",
"Life Record":"0.3714",
"Multilingual":"0.4333"
},
"sub_category": {
"Humanity & History": "0.3000",
"Literature & Art":"0.5000",
"Biology & Medicine": "0.5333",
"Finance & Commerce": "0.5333",
"Astronomy": "0.4667",
"Geography": "0.3667",
"Law":"0.5000",
"Life Tip": "0.6000",
"Technology":"0.2000",
"Animation": "0.3000",
"Movie & TV Show":"0.5667",
"Documentary":"0.5333",
"News Report":"0.5333",
"Esports":"0.3333",
"Basketball":"0.2333",
"Football": "0.3667",
"Athletics": "0.4667",
"Other Sports":"0.4000",
"Stage Play":"0.6667",
"Magic Show":"0.6000",
"Variety Show":"0.5667",
"Acrobatics":"0.5000",
"Handicraft":"0.5000",
"Food": "0.2000",
"Fashion":"0.3000",
"Daily Life":"0.2667",
"Travel":"0.4333",
"Pet & Animal":"0.3333",
"Exercise": "0.5667",
"Multilingual":"0.4333"
},
"task_type": {
"Temporal Perception":"0.3226",
"Spatial Perception": "0.5238",
"Attribute Perception":"0.5068",
"Action Recognition": "0.3277",
"Object Recognition": "0.4924",
"OCR Problems":"0.4118",
"Counting Problem":"0.3053",
"Temporal Reasoning": "0.3288",
"Spatial Reasoning": "0.6667",
(continues on next page)
```
**228 Chapter 1. Documentation**


```
(continued from previous page)
"Action Reasoning":"0.4655",
"Object Reasoning":"0.4478",
"Information Synopsis":"0.6538"
}
},
"long": {
"overall":"0.3856",
"domain": {
"Knowledge": "0.3889",
"Film & Television": "0.3750",
"Sports Competition": "0.3867",
"Artistic Performance":"0.3417",
"Life Record":"0.4048",
"Multilingual":"0.4333"
},
"sub_category": {
"Humanity & History": "0.3000",
"Literature & Art":"0.5000",
"Biology & Medicine": "0.4333",
"Finance & Commerce": "0.5000",
"Astronomy": "0.3000",
"Geography": "0.3000",
"Law":"0.4333",
"Life Tip": "0.3333",
"Technology":"0.4000",
"Animation": "0.2333",
"Movie & TV Show":"0.4667",
"Documentary":"0.4333",
"News Report":"0.3667",
"Esports":"0.4667",
"Basketball":"0.2667",
"Football": "0.3000",
"Athletics": "0.3333",
"Other Sports":"0.5667",
"Stage Play":"0.4000",
"Magic Show":"0.3000",
"Variety Show":"0.2000",
"Acrobatics":"0.4667",
"Handicraft":"0.5000",
"Food": "0.2000",
"Fashion":"0.4000",
"Daily Life":"0.4333",
"Travel":"0.2333",
"Pet & Animal":"0.7000",
"Exercise": "0.3667",
"Multilingual":"0.4333"
},
"task_type": {
"Temporal Perception":"0.0000",
"Spatial Perception": "0.3333",
"Attribute Perception":"0.4444",
"Action Recognition": "0.4603",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 229**


```
(continued from previous page)
"Object Recognition": "0.3519",
"OCR Problems":"0.4286",
"Counting Problem":"0.2292",
"Temporal Reasoning": "0.3187",
"Spatial Reasoning": "0.5455",
"Action Reasoning":"0.3222",
"Object Reasoning":"0.3625",
"Information Synopsis":"0.5460"
}
},
"overall": {
"overall":"0.4733",
"domain": {
"Knowledge": "0.4753",
"Film & Television": "0.5056",
"Sports Competition": "0.4444",
"Artistic Performance":"0.5111",
"Life Record":"0.4571",
"Multilingual":"0.4333"
},
"sub_category": {
"Humanity & History": "0.3556",
"Literature & Art":"0.5111",
"Biology & Medicine": "0.5889",
"Finance & Commerce": "0.5222",
"Astronomy": "0.4444",
"Geography": "0.4333",
"Law":"0.5111",
"Life Tip": "0.5222",
"Technology":"0.3889",
"Animation": "0.3667",
"Movie & TV Show":"0.5556",
"Documentary":"0.5333",
"News Report":"0.5667",
"Esports":"0.4556",
"Basketball":"0.3111",
"Football": "0.4444",
"Athletics": "0.4778",
"Other Sports":"0.5333",
"Stage Play":"0.5889",
"Magic Show":"0.4667",
"Variety Show":"0.4889",
"Acrobatics":"0.5000",
"Handicraft":"0.4667",
"Food": "0.3556",
"Fashion":"0.4111",
"Daily Life":"0.4556",
"Travel":"0.4111",
"Pet & Animal":"0.5889",
"Exercise": "0.5111",
"Multilingual":"0.4333"
},
(continues on next page)
```
**230 Chapter 1. Documentation**


(continued from previous page)
"task_type": {
"Temporal Perception":"0.4545",
"Spatial Perception": "0.5741",
"Attribute Perception":"0.6171",
"Action Recognition": "0.4473",
"Object Recognition": "0.5000",
"OCR Problems":"0.5468",
"Counting Problem":"0.3097",
"Temporal Reasoning": "0.3220",
"Spatial Reasoning": "0.7143",
"Action Reasoning":"0.4140",
"Object Reasoning":"0.4207",
"Information Synopsis":"0.6285"
}
}
}

When testing without subtitles:

torchrun --nproc-per-node= 8 run.py --data Video-MME --model InternVL2-4B --verbose --
˓→nframe 16

The expected test results are:

{
"short": {
"overall":"0.6289",
"domain": {
"Knowledge": "0.6519",
"Film & Television": "0.7000",
"Sports Competition": "0.5800",
"Artistic Performance":"0.6417",
"Life Record":"0.6095",
"Multilingual":"0.4667"
},
"sub_category": {
"Humanity & History": "0.5333",
"Literature & Art":"0.6000",
"Biology & Medicine": "0.7667",
"Finance & Commerce": "0.6000",
"Astronomy": "0.6333",
"Geography": "0.5667",
"Law":"0.7333",
"Life Tip": "0.7667",
"Technology":"0.6667",
"Animation": "0.6000",
"Movie & TV Show":"0.6667",
"Documentary":"0.6333",
"News Report":"0.9000",
"Esports":"0.5333",
"Basketball":"0.4667",
"Football": "0.6667",
(continues on next page)

**1.23. Evaluation of InternVL2 Series 231**


```
(continued from previous page)
"Athletics": "0.6333",
"Other Sports":"0.6000",
"Stage Play":"0.8000",
"Magic Show":"0.6000",
"Variety Show":"0.5667",
"Acrobatics":"0.6000",
"Handicraft":"0.5667",
"Food": "0.5667",
"Fashion":"0.5333",
"Daily Life":"0.6000",
"Travel":"0.7000",
"Pet & Animal":"0.7667",
"Exercise": "0.5333",
"Multilingual":"0.4667"
},
"task_type": {
"Temporal Perception":"0.8889",
"Spatial Perception": "0.6333",
"Attribute Perception":"0.7459",
"Action Recognition": "0.6183",
"Object Recognition": "0.6369",
"OCR Problems":"0.6140",
"Counting Problem":"0.3200",
"Temporal Reasoning": "0.4615",
"Spatial Reasoning": "0.7778",
"Action Reasoning":"0.7021",
"Object Reasoning":"0.6250",
"Information Synopsis":"0.8171"
}
},
"medium": {
"overall":"0.4678",
"domain": {
"Knowledge": "0.4704",
"Film & Television": "0.5083",
"Sports Competition": "0.4133",
"Artistic Performance":"0.5333",
"Life Record":"0.4381",
"Multilingual":"0.5000"
},
"sub_category": {
"Humanity & History": "0.2667",
"Literature & Art":"0.6000",
"Biology & Medicine": "0.5333",
"Finance & Commerce": "0.5333",
"Astronomy": "0.5000",
"Geography": "0.4000",
"Law":"0.5000",
"Life Tip": "0.5333",
"Technology":"0.3667",
"Animation": "0.2333",
"Movie & TV Show":"0.6333",
(continues on next page)
```
**232 Chapter 1. Documentation**


```
(continued from previous page)
"Documentary":"0.6000",
"News Report":"0.5667",
"Esports":"0.3667",
"Basketball":"0.3667",
"Football": "0.4333",
"Athletics": "0.4333",
"Other Sports":"0.4667",
"Stage Play":"0.6000",
"Magic Show":"0.4000",
"Variety Show":"0.5000",
"Acrobatics":"0.6333",
"Handicraft":"0.7000",
"Food": "0.3667",
"Fashion":"0.3333",
"Daily Life":"0.3000",
"Travel":"0.4333",
"Pet & Animal":"0.3667",
"Exercise": "0.5667",
"Multilingual":"0.5000"
},
"task_type": {
"Temporal Perception":"0.4839",
"Spatial Perception": "0.4762",
"Attribute Perception":"0.5205",
"Action Recognition": "0.3866",
"Object Recognition": "0.5530",
"OCR Problems":"0.4559",
"Counting Problem":"0.3053",
"Temporal Reasoning": "0.3014",
"Spatial Reasoning": "0.7222",
"Action Reasoning":"0.5172",
"Object Reasoning":"0.4925",
"Information Synopsis":"0.6154"
}
},
"long": {
"overall":"0.4467",
"domain": {
"Knowledge": "0.4815",
"Film & Television": "0.4333",
"Sports Competition": "0.4267",
"Artistic Performance":"0.4250",
"Life Record":"0.4333",
"Multilingual":"0.4667"
},
"sub_category": {
"Humanity & History": "0.3333",
"Literature & Art":"0.5000",
"Biology & Medicine": "0.5000",
"Finance & Commerce": "0.5333",
"Astronomy": "0.5333",
"Geography": "0.3333",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 233**


```
(continued from previous page)
"Law":"0.5000",
"Life Tip": "0.5333",
"Technology":"0.5667",
"Animation": "0.2667",
"Movie & TV Show":"0.5333",
"Documentary":"0.5000",
"News Report":"0.4333",
"Esports":"0.4667",
"Basketball":"0.4000",
"Football": "0.4333",
"Athletics": "0.3667",
"Other Sports":"0.4667",
"Stage Play":"0.6000",
"Magic Show":"0.4333",
"Variety Show":"0.2667",
"Acrobatics":"0.4000",
"Handicraft":"0.5000",
"Food": "0.3000",
"Fashion":"0.4667",
"Daily Life":"0.3000",
"Travel":"0.3000",
"Pet & Animal":"0.6667",
"Exercise": "0.5000",
"Multilingual":"0.4667"
},
"task_type": {
"Temporal Perception":"0.5000",
"Spatial Perception": "0.6667",
"Attribute Perception":"0.5185",
"Action Recognition": "0.3810",
"Object Recognition": "0.4815",
"OCR Problems":"0.3571",
"Counting Problem":"0.2708",
"Temporal Reasoning": "0.2637",
"Spatial Reasoning": "0.5455",
"Action Reasoning":"0.4556",
"Object Reasoning":"0.4500",
"Information Synopsis":"0.5828"
}
},
"overall": {
"overall":"0.5144",
"domain": {
"Knowledge": "0.5346",
"Film & Television": "0.5472",
"Sports Competition": "0.4733",
"Artistic Performance":"0.5333",
"Life Record":"0.4937",
"Multilingual":"0.4778"
},
"sub_category": {
"Humanity & History": "0.3778",
(continues on next page)
```
**234 Chapter 1. Documentation**


(continued from previous page)
"Literature & Art":"0.5667",
"Biology & Medicine": "0.6000",
"Finance & Commerce": "0.5556",
"Astronomy": "0.5556",
"Geography": "0.4333",
"Law":"0.5778",
"Life Tip": "0.6111",
"Technology":"0.5333",
"Animation": "0.3667",
"Movie & TV Show":"0.6111",
"Documentary":"0.5778",
"News Report":"0.6333",
"Esports":"0.4556",
"Basketball":"0.4111",
"Football": "0.5111",
"Athletics": "0.4778",
"Other Sports":"0.5111",
"Stage Play":"0.6667",
"Magic Show":"0.4778",
"Variety Show":"0.4444",
"Acrobatics":"0.5444",
"Handicraft":"0.5889",
"Food": "0.4111",
"Fashion":"0.4444",
"Daily Life":"0.4000",
"Travel":"0.4778",
"Pet & Animal":"0.6000",
"Exercise": "0.5333",
"Multilingual":"0.4778"
},
"task_type": {
"Temporal Perception":"0.6182",
"Spatial Perception": "0.5741",
"Attribute Perception":"0.6441",
"Action Recognition": "0.4824",
"Object Recognition": "0.5819",
"OCR Problems":"0.5108",
"Counting Problem":"0.3060",
"Temporal Reasoning": "0.2938",
"Spatial Reasoning": "0.7143",
"Action Reasoning":"0.5088",
"Object Reasoning":"0.4934",
"Information Synopsis":"0.6502"
}
}
}

When testing with subtitles:

torchrun --nproc-per-node= 8 run.py --data Video-MME --model InternVL2-4B --verbose --
˓→nframe 16 --use-subtitle

The expected test results are:

**1.23. Evaluation of InternVL2 Series 235**


### {

```
"short": {
"overall":"0.6511",
"domain": {
"Knowledge": "0.6852",
"Film & Television": "0.7083",
"Sports Competition": "0.5933",
"Artistic Performance":"0.6750",
"Life Record":"0.6286",
"Multilingual":"0.4667"
},
"sub_category": {
"Humanity & History": "0.5333",
"Literature & Art":"0.5667",
"Biology & Medicine": "0.8333",
"Finance & Commerce": "0.6667",
"Astronomy": "0.7000",
"Geography": "0.6333",
"Law":"0.7667",
"Life Tip": "0.7667",
"Technology":"0.7000",
"Animation": "0.4667",
"Movie & TV Show":"0.7333",
"Documentary":"0.7000",
"News Report":"0.9333",
"Esports":"0.5000",
"Basketball":"0.5000",
"Football": "0.6333",
"Athletics": "0.7000",
"Other Sports":"0.6333",
"Stage Play":"0.7667",
"Magic Show":"0.7000",
"Variety Show":"0.5667",
"Acrobatics":"0.6667",
"Handicraft":"0.6333",
"Food": "0.6000",
"Fashion":"0.5333",
"Daily Life":"0.6667",
"Travel":"0.7000",
"Pet & Animal":"0.7333",
"Exercise": "0.5333",
"Multilingual":"0.4667"
},
"task_type": {
"Temporal Perception":"0.8333",
"Spatial Perception": "0.6667",
"Attribute Perception":"0.7787",
"Action Recognition": "0.6260",
"Object Recognition": "0.6429",
"OCR Problems":"0.6667",
"Counting Problem":"0.3360",
"Temporal Reasoning": "0.6154",
"Spatial Reasoning": "0.8148",
(continues on next page)
```
**236 Chapter 1. Documentation**


```
(continued from previous page)
"Action Reasoning":"0.7234",
"Object Reasoning":"0.6375",
"Information Synopsis":"0.8659"
}
},
"medium": {
"overall":"0.4878",
"domain": {
"Knowledge": "0.5148",
"Film & Television": "0.5417",
"Sports Competition": "0.4067",
"Artistic Performance":"0.5417",
"Life Record":"0.4619",
"Multilingual":"0.4000"
},
"sub_category": {
"Humanity & History": "0.3667",
"Literature & Art":"0.5667",
"Biology & Medicine": "0.5667",
"Finance & Commerce": "0.5667",
"Astronomy": "0.7000",
"Geography": "0.3667",
"Law":"0.6000",
"Life Tip": "0.4667",
"Technology":"0.4333",
"Animation": "0.2667",
"Movie & TV Show":"0.6667",
"Documentary":"0.5667",
"News Report":"0.6667",
"Esports":"0.4667",
"Basketball":"0.2333",
"Football": "0.4333",
"Athletics": "0.4333",
"Other Sports":"0.4667",
"Stage Play":"0.6333",
"Magic Show":"0.4333",
"Variety Show":"0.5000",
"Acrobatics":"0.6000",
"Handicraft":"0.7000",
"Food": "0.3333",
"Fashion":"0.3667",
"Daily Life":"0.3667",
"Travel":"0.5000",
"Pet & Animal":"0.4000",
"Exercise": "0.5667",
"Multilingual":"0.4000"
},
"task_type": {
"Temporal Perception":"0.4194",
"Spatial Perception": "0.4286",
"Attribute Perception":"0.5479",
"Action Recognition": "0.3950",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 237**


```
(continued from previous page)
"Object Recognition": "0.5606",
"OCR Problems":"0.4559",
"Counting Problem":"0.3474",
"Temporal Reasoning": "0.2877",
"Spatial Reasoning": "0.8333",
"Action Reasoning":"0.4655",
"Object Reasoning":"0.5522",
"Information Synopsis":"0.7051"
}
},
"long": {
"overall":"0.4622",
"domain": {
"Knowledge": "0.4889",
"Film & Television": "0.4750",
"Sports Competition": "0.4267",
"Artistic Performance":"0.4500",
"Life Record":"0.4476",
"Multilingual":"0.5000"
},
"sub_category": {
"Humanity & History": "0.2667",
"Literature & Art":"0.5667",
"Biology & Medicine": "0.5333",
"Finance & Commerce": "0.6000",
"Astronomy": "0.5333",
"Geography": "0.3333",
"Law":"0.6000",
"Life Tip": "0.5000",
"Technology":"0.4667",
"Animation": "0.3333",
"Movie & TV Show":"0.5000",
"Documentary":"0.6000",
"News Report":"0.4667",
"Esports":"0.5000",
"Basketball":"0.4000",
"Football": "0.5333",
"Athletics": "0.3000",
"Other Sports":"0.4000",
"Stage Play":"0.7333",
"Magic Show":"0.4333",
"Variety Show":"0.2333",
"Acrobatics":"0.4000",
"Handicraft":"0.5667",
"Food": "0.2667",
"Fashion":"0.4667",
"Daily Life":"0.3333",
"Travel":"0.3000",
"Pet & Animal":"0.7000",
"Exercise": "0.5000",
"Multilingual":"0.5000"
},
(continues on next page)
```
**238 Chapter 1. Documentation**


```
(continued from previous page)
"task_type": {
"Temporal Perception":"0.3333",
"Spatial Perception": "0.3333",
"Attribute Perception":"0.5185",
"Action Recognition": "0.4444",
"Object Recognition": "0.4815",
"OCR Problems":"0.2857",
"Counting Problem":"0.2708",
"Temporal Reasoning": "0.2418",
"Spatial Reasoning": "0.5455",
"Action Reasoning":"0.4444",
"Object Reasoning":"0.4708",
"Information Synopsis":"0.6564"
}
},
"overall": {
"overall":"0.5337",
"domain": {
"Knowledge": "0.5630",
"Film & Television": "0.5750",
"Sports Competition": "0.4756",
"Artistic Performance":"0.5556",
"Life Record":"0.5127",
"Multilingual":"0.4556"
},
"sub_category": {
"Humanity & History": "0.3889",
"Literature & Art":"0.5667",
"Biology & Medicine": "0.6444",
"Finance & Commerce": "0.6111",
"Astronomy": "0.6444",
"Geography": "0.4444",
"Law":"0.6556",
"Life Tip": "0.5778",
"Technology":"0.5333",
"Animation": "0.3556",
"Movie & TV Show":"0.6333",
"Documentary":"0.6222",
"News Report":"0.6889",
"Esports":"0.4889",
"Basketball":"0.3778",
"Football": "0.5333",
"Athletics": "0.4778",
"Other Sports":"0.5000",
"Stage Play":"0.7111",
"Magic Show":"0.5222",
"Variety Show":"0.4333",
"Acrobatics":"0.5556",
"Handicraft":"0.6333",
"Food": "0.4000",
"Fashion":"0.4556",
"Daily Life":"0.4556",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 239**


(continued from previous page)
"Travel":"0.5000",
"Pet & Animal":"0.6111",
"Exercise": "0.5333",
"Multilingual":"0.4556"
},
"task_type": {
"Temporal Perception":"0.5455",
"Spatial Perception": "0.5556",
"Attribute Perception":"0.6712",
"Action Recognition": "0.5016",
"Object Recognition": "0.5876",
"OCR Problems":"0.5252",
"Counting Problem":"0.3284",
"Temporal Reasoning": "0.2881",
"Spatial Reasoning": "0.7679",
"Action Reasoning":"0.4947",
"Object Reasoning":"0.5242",
"Information Synopsis":"0.7214"
}
}
}

When testing without subtitles:

torchrun --nproc-per-node= 8 run.py --data Video-MME --model InternVL2-8B --verbose --
˓→nframe 16

The expected test results are:

{
"short": {
"overall":"0.6567",
"domain": {
"Knowledge": "0.6704",
"Film & Television": "0.7083",
"Sports Competition": "0.5933",
"Artistic Performance":"0.7000",
"Life Record":"0.6619",
"Multilingual":"0.4333"
},
"sub_category": {
"Humanity & History": "0.6000",
"Literature & Art":"0.6000",
"Biology & Medicine": "0.7667",
"Finance & Commerce": "0.7000",
"Astronomy": "0.6000",
"Geography": "0.7000",
"Law":"0.7000",
"Life Tip": "0.7000",
"Technology":"0.6667",
"Animation": "0.8000",
"Movie & TV Show":"0.6000",
(continues on next page)

**240 Chapter 1. Documentation**


```
(continued from previous page)
"Documentary":"0.6333",
"News Report":"0.8000",
"Esports":"0.5333",
"Basketball":"0.3667",
"Football": "0.7000",
"Athletics": "0.7333",
"Other Sports":"0.6333",
"Stage Play":"0.8333",
"Magic Show":"0.6667",
"Variety Show":"0.6333",
"Acrobatics":"0.6667",
"Handicraft":"0.7000",
"Food": "0.6667",
"Fashion":"0.5333",
"Daily Life":"0.6667",
"Travel":"0.7667",
"Pet & Animal":"0.7667",
"Exercise": "0.5333",
"Multilingual":"0.4333"
},
"task_type": {
"Temporal Perception":"0.7222",
"Spatial Perception": "0.7667",
"Attribute Perception":"0.7623",
"Action Recognition": "0.5954",
"Object Recognition": "0.6845",
"OCR Problems":"0.7719",
"Counting Problem":"0.4080",
"Temporal Reasoning": "0.6154",
"Spatial Reasoning": "0.8148",
"Action Reasoning":"0.6596",
"Object Reasoning":"0.6250",
"Information Synopsis":"0.7683"
}
},
"medium": {
"overall":"0.5044",
"domain": {
"Knowledge": "0.5148",
"Film & Television": "0.5750",
"Sports Competition": "0.4533",
"Artistic Performance":"0.5917",
"Life Record":"0.4429",
"Multilingual":"0.4667"
},
"sub_category": {
"Humanity & History": "0.4333",
"Literature & Art":"0.6333",
"Biology & Medicine": "0.5667",
"Finance & Commerce": "0.6000",
"Astronomy": "0.4333",
"Geography": "0.3333",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 241**


```
(continued from previous page)
"Law":"0.5667",
"Life Tip": "0.6333",
"Technology":"0.4333",
"Animation": "0.4000",
"Movie & TV Show":"0.6667",
"Documentary":"0.5667",
"News Report":"0.6667",
"Esports":"0.5667",
"Basketball":"0.2667",
"Football": "0.4667",
"Athletics": "0.4333",
"Other Sports":"0.5333",
"Stage Play":"0.8000",
"Magic Show":"0.4667",
"Variety Show":"0.5667",
"Acrobatics":"0.5333",
"Handicraft":"0.5667",
"Food": "0.4000",
"Fashion":"0.5000",
"Daily Life":"0.3333",
"Travel":"0.4333",
"Pet & Animal":"0.3667",
"Exercise": "0.5000",
"Multilingual":"0.4667"
},
"task_type": {
"Temporal Perception":"0.4516",
"Spatial Perception": "0.5714",
"Attribute Perception":"0.4932",
"Action Recognition": "0.3782",
"Object Recognition": "0.6212",
"OCR Problems":"0.4706",
"Counting Problem":"0.3053",
"Temporal Reasoning": "0.3836",
"Spatial Reasoning": "0.6111",
"Action Reasoning":"0.5172",
"Object Reasoning":"0.5970",
"Information Synopsis":"0.7051"
}
},
"long": {
"overall":"0.4589",
"domain": {
"Knowledge": "0.5037",
"Film & Television": "0.4500",
"Sports Competition": "0.4733",
"Artistic Performance":"0.4417",
"Life Record":"0.4048",
"Multilingual":"0.4667"
},
"sub_category": {
"Humanity & History": "0.5333",
(continues on next page)
```
**242 Chapter 1. Documentation**


```
(continued from previous page)
"Literature & Art":"0.5000",
"Biology & Medicine": "0.6000",
"Finance & Commerce": "0.5000",
"Astronomy": "0.5000",
"Geography": "0.3667",
"Law":"0.5333",
"Life Tip": "0.5667",
"Technology":"0.4333",
"Animation": "0.2667",
"Movie & TV Show":"0.5667",
"Documentary":"0.4667",
"News Report":"0.5000",
"Esports":"0.5000",
"Basketball":"0.3667",
"Football": "0.5000",
"Athletics": "0.5000",
"Other Sports":"0.5000",
"Stage Play":"0.6333",
"Magic Show":"0.3333",
"Variety Show":"0.3000",
"Acrobatics":"0.5000",
"Handicraft":"0.4667",
"Food": "0.2667",
"Fashion":"0.4000",
"Daily Life":"0.3333",
"Travel":"0.3667",
"Pet & Animal":"0.6333",
"Exercise": "0.3667",
"Multilingual":"0.4667"
},
"task_type": {
"Temporal Perception":"0.1667",
"Spatial Perception": "0.0000",
"Attribute Perception":"0.6296",
"Action Recognition": "0.4127",
"Object Recognition": "0.5000",
"OCR Problems":"0.5000",
"Counting Problem":"0.3542",
"Temporal Reasoning": "0.3297",
"Spatial Reasoning": "0.6364",
"Action Reasoning":"0.4000",
"Object Reasoning":"0.4625",
"Information Synopsis":"0.6012"
}
},
"overall": {
"overall":"0.5400",
"domain": {
"Knowledge": "0.5630",
"Film & Television": "0.5778",
"Sports Competition": "0.5067",
"Artistic Performance":"0.5778",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 243**


(continued from previous page)
"Life Record":"0.5032",
"Multilingual":"0.4556"
},
"sub_category": {
"Humanity & History": "0.5222",
"Literature & Art":"0.5778",
"Biology & Medicine": "0.6444",
"Finance & Commerce": "0.6000",
"Astronomy": "0.5111",
"Geography": "0.4667",
"Law":"0.6000",
"Life Tip": "0.6333",
"Technology":"0.5111",
"Animation": "0.4889",
"Movie & TV Show":"0.6111",
"Documentary":"0.5556",
"News Report":"0.6556",
"Esports":"0.5333",
"Basketball":"0.3333",
"Football": "0.5556",
"Athletics": "0.5556",
"Other Sports":"0.5556",
"Stage Play":"0.7556",
"Magic Show":"0.4889",
"Variety Show":"0.5000",
"Acrobatics":"0.5667",
"Handicraft":"0.5778",
"Food": "0.4444",
"Fashion":"0.4778",
"Daily Life":"0.4444",
"Travel":"0.5222",
"Pet & Animal":"0.5889",
"Exercise": "0.4667",
"Multilingual":"0.4556"
},
"task_type": {
"Temporal Perception":"0.5091",
"Spatial Perception": "0.6481",
"Attribute Perception":"0.6577",
"Action Recognition": "0.4760",
"Object Recognition": "0.6328",
"OCR Problems":"0.5971",
"Counting Problem":"0.3619",
"Temporal Reasoning": "0.3729",
"Spatial Reasoning": "0.7143",
"Action Reasoning":"0.4667",
"Object Reasoning":"0.5308",
"Information Synopsis":"0.6687"
}
}
}

When testing with subtitles:

**244 Chapter 1. Documentation**


torchrun --nproc-per-node= 8 run.py --data Video-MME --model InternVL2-8B --verbose --
˓→nframe 16 --use-subtitle

The expected test results are:

{
"short": {
"overall":"0.6900",
"domain": {
"Knowledge": "0.7148",
"Film & Television": "0.7500",
"Sports Competition": "0.5933",
"Artistic Performance":"0.7250",
"Life Record":"0.7000",
"Multilingual":"0.5000"
},
"sub_category": {
"Humanity & History": "0.5667",
"Literature & Art":"0.6333",
"Biology & Medicine": "0.8333",
"Finance & Commerce": "0.8333",
"Astronomy": "0.6667",
"Geography": "0.7000",
"Law":"0.7000",
"Life Tip": "0.8000",
"Technology":"0.7000",
"Animation": "0.7667",
"Movie & TV Show":"0.6667",
"Documentary":"0.6667",
"News Report":"0.9000",
"Esports":"0.5333",
"Basketball":"0.4000",
"Football": "0.6333",
"Athletics": "0.7667",
"Other Sports":"0.6333",
"Stage Play":"0.8000",
"Magic Show":"0.6667",
"Variety Show":"0.7667",
"Acrobatics":"0.6667",
"Handicraft":"0.6667",
"Food": "0.7000",
"Fashion":"0.5667",
"Daily Life":"0.7000",
"Travel":"0.8333",
"Pet & Animal":"0.8333",
"Exercise": "0.6000",
"Multilingual":"0.5000"
},
"task_type": {
"Temporal Perception":"0.6667",
"Spatial Perception": "0.7667",
"Attribute Perception":"0.7951",
"Action Recognition": "0.6412",
(continues on next page)

**1.23. Evaluation of InternVL2 Series 245**


```
(continued from previous page)
"Object Recognition": "0.6964",
"OCR Problems":"0.7895",
"Counting Problem":"0.4240",
"Temporal Reasoning": "0.6923",
"Spatial Reasoning": "0.8519",
"Action Reasoning":"0.7021",
"Object Reasoning":"0.6875",
"Information Synopsis":"0.8537"
}
},
"medium": {
"overall":"0.5256",
"domain": {
"Knowledge": "0.5593",
"Film & Television": "0.6167",
"Sports Competition": "0.4400",
"Artistic Performance":"0.6167",
"Life Record":"0.4429",
"Multilingual":"0.5000"
},
"sub_category": {
"Humanity & History": "0.4667",
"Literature & Art":"0.6000",
"Biology & Medicine": "0.5667",
"Finance & Commerce": "0.6667",
"Astronomy": "0.5667",
"Geography": "0.4667",
"Law":"0.5667",
"Life Tip": "0.6667",
"Technology":"0.4667",
"Animation": "0.3667",
"Movie & TV Show":"0.6667",
"Documentary":"0.6667",
"News Report":"0.7667",
"Esports":"0.5667",
"Basketball":"0.2667",
"Football": "0.4333",
"Athletics": "0.4333",
"Other Sports":"0.5000",
"Stage Play":"0.8333",
"Magic Show":"0.5333",
"Variety Show":"0.5667",
"Acrobatics":"0.5333",
"Handicraft":"0.5667",
"Food": "0.3667",
"Fashion":"0.4000",
"Daily Life":"0.4333",
"Travel":"0.4333",
"Pet & Animal":"0.4000",
"Exercise": "0.5000",
"Multilingual":"0.5000"
},
(continues on next page)
```
**246 Chapter 1. Documentation**


```
(continued from previous page)
"task_type": {
"Temporal Perception":"0.4516",
"Spatial Perception": "0.5238",
"Attribute Perception":"0.5068",
"Action Recognition": "0.4034",
"Object Recognition": "0.6515",
"OCR Problems":"0.4118",
"Counting Problem":"0.3053",
"Temporal Reasoning": "0.3973",
"Spatial Reasoning": "0.7778",
"Action Reasoning":"0.5517",
"Object Reasoning":"0.6194",
"Information Synopsis":"0.7949"
}
},
"long": {
"overall":"0.4922",
"domain": {
"Knowledge": "0.5667",
"Film & Television": "0.4917",
"Sports Competition": "0.4800",
"Artistic Performance":"0.4583",
"Life Record":"0.4381",
"Multilingual":"0.4000"
},
"sub_category": {
"Humanity & History": "0.5667",
"Literature & Art":"0.5667",
"Biology & Medicine": "0.7333",
"Finance & Commerce": "0.5333",
"Astronomy": "0.5667",
"Geography": "0.4000",
"Law":"0.6667",
"Life Tip": "0.6000",
"Technology":"0.4667",
"Animation": "0.3333",
"Movie & TV Show":"0.5000",
"Documentary":"0.6000",
"News Report":"0.5333",
"Esports":"0.4333",
"Basketball":"0.4000",
"Football": "0.5333",
"Athletics": "0.4667",
"Other Sports":"0.5667",
"Stage Play":"0.7333",
"Magic Show":"0.3333",
"Variety Show":"0.3000",
"Acrobatics":"0.4667",
"Handicraft":"0.5667",
"Food": "0.3333",
"Fashion":"0.4333",
"Daily Life":"0.2667",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 247**


```
(continued from previous page)
"Travel":"0.3667",
"Pet & Animal":"0.7333",
"Exercise": "0.3667",
"Multilingual":"0.4000"
},
"task_type": {
"Temporal Perception":"0.1667",
"Spatial Perception": "0.0000",
"Attribute Perception":"0.7037",
"Action Recognition": "0.4286",
"Object Recognition": "0.5000",
"OCR Problems":"0.5714",
"Counting Problem":"0.2917",
"Temporal Reasoning": "0.3077",
"Spatial Reasoning": "0.7273",
"Action Reasoning":"0.4278",
"Object Reasoning":"0.4917",
"Information Synopsis":"0.7117"
}
},
"overall": {
"overall":"0.5693",
"domain": {
"Knowledge": "0.6136",
"Film & Television": "0.6194",
"Sports Competition": "0.5044",
"Artistic Performance":"0.6000",
"Life Record":"0.5270",
"Multilingual":"0.4667"
},
"sub_category": {
"Humanity & History": "0.5333",
"Literature & Art":"0.6000",
"Biology & Medicine": "0.7111",
"Finance & Commerce": "0.6778",
"Astronomy": "0.6000",
"Geography": "0.5222",
"Law":"0.6444",
"Life Tip": "0.6889",
"Technology":"0.5444",
"Animation": "0.4889",
"Movie & TV Show":"0.6111",
"Documentary":"0.6444",
"News Report":"0.7333",
"Esports":"0.5111",
"Basketball":"0.3556",
"Football": "0.5333",
"Athletics": "0.5556",
"Other Sports":"0.5667",
"Stage Play":"0.7889",
"Magic Show":"0.5111",
"Variety Show":"0.5444",
(continues on next page)
```
**248 Chapter 1. Documentation**


(continued from previous page)
"Acrobatics":"0.5556",
"Handicraft":"0.6000",
"Food": "0.4667",
"Fashion":"0.4667",
"Daily Life":"0.4667",
"Travel":"0.5444",
"Pet & Animal":"0.6556",
"Exercise": "0.4889",
"Multilingual":"0.4667"
},
"task_type": {
"Temporal Perception":"0.4909",
"Spatial Perception": "0.6296",
"Attribute Perception":"0.6892",
"Action Recognition": "0.5080",
"Object Recognition": "0.6497",
"OCR Problems":"0.5827",
"Counting Problem":"0.3582",
"Temporal Reasoning": "0.3729",
"Spatial Reasoning": "0.8036",
"Action Reasoning":"0.4982",
"Object Reasoning":"0.5639",
"Information Synopsis":"0.7678"
}
}
}

When testing without subtitles:

torchrun --nproc-per-node= 8 run.py --data Video-MME --model InternVL2-26B --verbose --
˓→nframe 16

The expected test results are:

{
"short": {
"overall":"0.6667",
"domain": {
"Knowledge": "0.6741",
"Film & Television": "0.7333",
"Sports Competition": "0.6133",
"Artistic Performance":"0.6750",
"Life Record":"0.6762",
"Multilingual":"0.5000"
},
"sub_category": {
"Humanity & History": "0.4000",
"Literature & Art":"0.5667",
"Biology & Medicine": "0.8667",
"Finance & Commerce": "0.7000",
"Astronomy": "0.6667",
"Geography": "0.6333",
(continues on next page)

**1.23. Evaluation of InternVL2 Series 249**


```
(continued from previous page)
"Law":"0.8000",
"Life Tip": "0.8000",
"Technology":"0.6333",
"Animation": "0.8000",
"Movie & TV Show":"0.7000",
"Documentary":"0.5667",
"News Report":"0.8667",
"Esports":"0.5333",
"Basketball":"0.4667",
"Football": "0.6333",
"Athletics": "0.7667",
"Other Sports":"0.6667",
"Stage Play":"0.8667",
"Magic Show":"0.5333",
"Variety Show":"0.6333",
"Acrobatics":"0.6667",
"Handicraft":"0.7000",
"Food": "0.7667",
"Fashion":"0.6667",
"Daily Life":"0.6667",
"Travel":"0.7667",
"Pet & Animal":"0.7333",
"Exercise": "0.4333",
"Multilingual":"0.5000"
},
"task_type": {
"Temporal Perception":"0.8333",
"Spatial Perception": "0.7333",
"Attribute Perception":"0.7541",
"Action Recognition": "0.6489",
"Object Recognition": "0.6548",
"OCR Problems":"0.7719",
"Counting Problem":"0.4080",
"Temporal Reasoning": "0.6154",
"Spatial Reasoning": "0.7778",
"Action Reasoning":"0.7234",
"Object Reasoning":"0.6500",
"Information Synopsis":"0.8049"
}
},
"medium": {
"overall":"0.5200",
"domain": {
"Knowledge": "0.5481",
"Film & Television": "0.5833",
"Sports Competition": "0.4267",
"Artistic Performance":"0.6167",
"Life Record":"0.4524",
"Multilingual":"0.5667"
},
"sub_category": {
"Humanity & History": "0.4000",
(continues on next page)
```
**250 Chapter 1. Documentation**


```
(continued from previous page)
"Literature & Art":"0.6000",
"Biology & Medicine": "0.6667",
"Finance & Commerce": "0.5667",
"Astronomy": "0.5333",
"Geography": "0.4667",
"Law":"0.6667",
"Life Tip": "0.5333",
"Technology":"0.5000",
"Animation": "0.3667",
"Movie & TV Show":"0.6000",
"Documentary":"0.7000",
"News Report":"0.6667",
"Esports":"0.4667",
"Basketball":"0.3000",
"Football": "0.5000",
"Athletics": "0.3667",
"Other Sports":"0.5000",
"Stage Play":"0.6667",
"Magic Show":"0.6333",
"Variety Show":"0.6000",
"Acrobatics":"0.5667",
"Handicraft":"0.6667",
"Food": "0.3000",
"Fashion":"0.4000",
"Daily Life":"0.4000",
"Travel":"0.5333",
"Pet & Animal":"0.4667",
"Exercise": "0.4000",
"Multilingual":"0.5667"
},
"task_type": {
"Temporal Perception":"0.4839",
"Spatial Perception": "0.5238",
"Attribute Perception":"0.5890",
"Action Recognition": "0.4454",
"Object Recognition": "0.6364",
"OCR Problems":"0.4412",
"Counting Problem":"0.3474",
"Temporal Reasoning": "0.3836",
"Spatial Reasoning": "0.7222",
"Action Reasoning":"0.4655",
"Object Reasoning":"0.5448",
"Information Synopsis":"0.7436"
}
},
"long": {
"overall":"0.4578",
"domain": {
"Knowledge": "0.4815",
"Film & Television": "0.4583",
"Sports Competition": "0.4200",
"Artistic Performance":"0.4167",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 251**


```
(continued from previous page)
"Life Record":"0.4857",
"Multilingual":"0.4000"
},
"sub_category": {
"Humanity & History": "0.5000",
"Literature & Art":"0.5667",
"Biology & Medicine": "0.5333",
"Finance & Commerce": "0.6000",
"Astronomy": "0.4667",
"Geography": "0.3000",
"Law":"0.5000",
"Life Tip": "0.4667",
"Technology":"0.4000",
"Animation": "0.3667",
"Movie & TV Show":"0.4667",
"Documentary":"0.5000",
"News Report":"0.5000",
"Esports":"0.4667",
"Basketball":"0.4000",
"Football": "0.4667",
"Athletics": "0.4000",
"Other Sports":"0.3667",
"Stage Play":"0.5667",
"Magic Show":"0.4667",
"Variety Show":"0.1333",
"Acrobatics":"0.5000",
"Handicraft":"0.6333",
"Food": "0.4333",
"Fashion":"0.3667",
"Daily Life":"0.5333",
"Travel":"0.3667",
"Pet & Animal":"0.6667",
"Exercise": "0.4000",
"Multilingual":"0.4000"
},
"task_type": {
"Temporal Perception":"0.0000",
"Spatial Perception": "0.3333",
"Attribute Perception":"0.5926",
"Action Recognition": "0.3968",
"Object Recognition": "0.5741",
"OCR Problems":"0.5000",
"Counting Problem":"0.2917",
"Temporal Reasoning": "0.2967",
"Spatial Reasoning": "0.6364",
"Action Reasoning":"0.4111",
"Object Reasoning":"0.4583",
"Information Synopsis":"0.6135"
}
},
"overall": {
"overall":"0.5481",
(continues on next page)
```
**252 Chapter 1. Documentation**


```
(continued from previous page)
"domain": {
"Knowledge": "0.5679",
"Film & Television": "0.5917",
"Sports Competition": "0.4867",
"Artistic Performance":"0.5694",
"Life Record":"0.5381",
"Multilingual":"0.4889"
},
"sub_category": {
"Humanity & History": "0.4333",
"Literature & Art":"0.5778",
"Biology & Medicine": "0.6889",
"Finance & Commerce": "0.6222",
"Astronomy": "0.5556",
"Geography": "0.4667",
"Law":"0.6556",
"Life Tip": "0.6000",
"Technology":"0.5111",
"Animation": "0.5111",
"Movie & TV Show":"0.5889",
"Documentary":"0.5889",
"News Report":"0.6778",
"Esports":"0.4889",
"Basketball":"0.3889",
"Football": "0.5333",
"Athletics": "0.5111",
"Other Sports":"0.5111",
"Stage Play":"0.7000",
"Magic Show":"0.5444",
"Variety Show":"0.4556",
"Acrobatics":"0.5778",
"Handicraft":"0.6667",
"Food": "0.5000",
"Fashion":"0.4778",
"Daily Life":"0.5333",
"Travel":"0.5556",
"Pet & Animal":"0.6222",
"Exercise": "0.4111",
"Multilingual":"0.4889"
},
"task_type": {
"Temporal Perception":"0.5455",
"Spatial Perception": "0.6296",
"Attribute Perception":"0.6802",
"Action Recognition": "0.5208",
"Object Recognition": "0.6356",
"OCR Problems":"0.5827",
"Counting Problem":"0.3657",
"Temporal Reasoning": "0.3559",
"Spatial Reasoning": "0.7321",
"Action Reasoning":"0.4737",
"Object Reasoning":"0.5176",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 253**


(continued from previous page)
"Information Synopsis":"0.6935"
}
}
}

When testing with subtitles:

torchrun --nproc-per-node= 8 run.py --data Video-MME --model InternVL2-26B --verbose --
˓→nframe 16 --use-subtitle

The expected test results are:

{
"short": {
"overall":"0.6844",
"domain": {
"Knowledge": "0.6889",
"Film & Television": "0.7250",
"Sports Competition": "0.6200",
"Artistic Performance":"0.7167",
"Life Record":"0.7000",
"Multilingual":"0.5667"
},
"sub_category": {
"Humanity & History": "0.3667",
"Literature & Art":"0.6000",
"Biology & Medicine": "0.9000",
"Finance & Commerce": "0.7333",
"Astronomy": "0.7000",
"Geography": "0.7333",
"Law":"0.8333",
"Life Tip": "0.7000",
"Technology":"0.6333",
"Animation": "0.7333",
"Movie & TV Show":"0.7333",
"Documentary":"0.5667",
"News Report":"0.8667",
"Esports":"0.6667",
"Basketball":"0.4333",
"Football": "0.6667",
"Athletics": "0.7333",
"Other Sports":"0.6000",
"Stage Play":"0.8333",
"Magic Show":"0.6000",
"Variety Show":"0.7667",
"Acrobatics":"0.6667",
"Handicraft":"0.6667",
"Food": "0.8333",
"Fashion":"0.6667",
"Daily Life":"0.7667",
"Travel":"0.7667",
"Pet & Animal":"0.7333",
(continues on next page)

**254 Chapter 1. Documentation**


```
(continued from previous page)
"Exercise": "0.4667",
"Multilingual":"0.5667"
},
"task_type": {
"Temporal Perception":"0.7778",
"Spatial Perception": "0.7000",
"Attribute Perception":"0.7869",
"Action Recognition": "0.6336",
"Object Recognition": "0.6905",
"OCR Problems":"0.8070",
"Counting Problem":"0.4080",
"Temporal Reasoning": "0.7692",
"Spatial Reasoning": "0.8519",
"Action Reasoning":"0.7021",
"Object Reasoning":"0.7125",
"Information Synopsis":"0.8049"
}
},
"medium": {
"overall":"0.5456",
"domain": {
"Knowledge": "0.5852",
"Film & Television": "0.6167",
"Sports Competition": "0.4400",
"Artistic Performance":"0.6333",
"Life Record":"0.4714",
"Multilingual":"0.6000"
},
"sub_category": {
"Humanity & History": "0.5333",
"Literature & Art":"0.5667",
"Biology & Medicine": "0.6333",
"Finance & Commerce": "0.6667",
"Astronomy": "0.6667",
"Geography": "0.5000",
"Law":"0.6333",
"Life Tip": "0.5667",
"Technology":"0.5000",
"Animation": "0.3333",
"Movie & TV Show":"0.6333",
"Documentary":"0.7000",
"News Report":"0.8000",
"Esports":"0.4333",
"Basketball":"0.2667",
"Football": "0.6000",
"Athletics": "0.3667",
"Other Sports":"0.5333",
"Stage Play":"0.7667",
"Magic Show":"0.6000",
"Variety Show":"0.6000",
"Acrobatics":"0.5667",
"Handicraft":"0.6333",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 255**


```
(continued from previous page)
"Food": "0.3000",
"Fashion":"0.4333",
"Daily Life":"0.3667",
"Travel":"0.6000",
"Pet & Animal":"0.4667",
"Exercise": "0.5000",
"Multilingual":"0.6000"
},
"task_type": {
"Temporal Perception":"0.4839",
"Spatial Perception": "0.4762",
"Attribute Perception":"0.5890",
"Action Recognition": "0.4622",
"Object Recognition": "0.6591",
"OCR Problems":"0.4706",
"Counting Problem":"0.3474",
"Temporal Reasoning": "0.4247",
"Spatial Reasoning": "0.8333",
"Action Reasoning":"0.4310",
"Object Reasoning":"0.6194",
"Information Synopsis":"0.7949"
}
},
"long": {
"overall":"0.4833",
"domain": {
"Knowledge": "0.5296",
"Film & Television": "0.5083",
"Sports Competition": "0.4333",
"Artistic Performance":"0.4583",
"Life Record":"0.4667",
"Multilingual":"0.4333"
},
"sub_category": {
"Humanity & History": "0.4667",
"Literature & Art":"0.5000",
"Biology & Medicine": "0.7000",
"Finance & Commerce": "0.6667",
"Astronomy": "0.6000",
"Geography": "0.3333",
"Law":"0.5667",
"Life Tip": "0.5000",
"Technology":"0.4333",
"Animation": "0.4000",
"Movie & TV Show":"0.4667",
"Documentary":"0.6667",
"News Report":"0.5000",
"Esports":"0.4667",
"Basketball":"0.3667",
"Football": "0.5667",
"Athletics": "0.3333",
"Other Sports":"0.4333",
(continues on next page)
```
**256 Chapter 1. Documentation**


```
(continued from previous page)
"Stage Play":"0.7667",
"Magic Show":"0.4000",
"Variety Show":"0.2000",
"Acrobatics":"0.4667",
"Handicraft":"0.6333",
"Food": "0.3333",
"Fashion":"0.4333",
"Daily Life":"0.4667",
"Travel":"0.3000",
"Pet & Animal":"0.7000",
"Exercise": "0.4000",
"Multilingual":"0.4333"
},
"task_type": {
"Temporal Perception":"0.0000",
"Spatial Perception": "0.3333",
"Attribute Perception":"0.5556",
"Action Recognition": "0.4444",
"Object Recognition": "0.4815",
"OCR Problems":"0.6429",
"Counting Problem":"0.3333",
"Temporal Reasoning": "0.2967",
"Spatial Reasoning": "0.7273",
"Action Reasoning":"0.4611",
"Object Reasoning":"0.4667",
"Information Synopsis":"0.6748"
}
},
"overall": {
"overall":"0.5711",
"domain": {
"Knowledge": "0.6012",
"Film & Television": "0.6167",
"Sports Competition": "0.4978",
"Artistic Performance":"0.6028",
"Life Record":"0.5460",
"Multilingual":"0.5333"
},
"sub_category": {
"Humanity & History": "0.4556",
"Literature & Art":"0.5556",
"Biology & Medicine": "0.7444",
"Finance & Commerce": "0.6889",
"Astronomy": "0.6556",
"Geography": "0.5222",
"Law":"0.6778",
"Life Tip": "0.5889",
"Technology":"0.5222",
"Animation": "0.4889",
"Movie & TV Show":"0.6111",
"Documentary":"0.6444",
"News Report":"0.7222",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 257**


(continued from previous page)
"Esports":"0.5222",
"Basketball":"0.3556",
"Football": "0.6111",
"Athletics": "0.4778",
"Other Sports":"0.5222",
"Stage Play":"0.7889",
"Magic Show":"0.5333",
"Variety Show":"0.5222",
"Acrobatics":"0.5667",
"Handicraft":"0.6444",
"Food": "0.4889",
"Fashion":"0.5111",
"Daily Life":"0.5333",
"Travel":"0.5556",
"Pet & Animal":"0.6333",
"Exercise": "0.4556",
"Multilingual":"0.5333"
},
"task_type": {
"Temporal Perception":"0.5273",
"Spatial Perception": "0.5926",
"Attribute Perception":"0.6937",
"Action Recognition": "0.5304",
"Object Recognition": "0.6469",
"OCR Problems":"0.6259",
"Counting Problem":"0.3731",
"Temporal Reasoning": "0.3842",
"Spatial Reasoning": "0.8214",
"Action Reasoning":"0.4947",
"Object Reasoning":"0.5551",
"Information Synopsis":"0.7368"
}
}
}

When testing without subtitles:

torchrun --nproc-per-node= 8 run.py --data Video-MME --model InternVL2-40B --verbose --
˓→nframe 16

The expected test results are:

{
"short": {
"overall":"0.7200",
"domain": {
"Knowledge": "0.7222",
"Film & Television": "0.7417",
"Sports Competition": "0.6667",
"Artistic Performance":"0.7583",
"Life Record":"0.7476",
"Multilingual":"0.5333"
(continues on next page)

**258 Chapter 1. Documentation**


```
(continued from previous page)
},
"sub_category": {
"Humanity & History": "0.4333",
"Literature & Art":"0.6667",
"Biology & Medicine": "0.9667",
"Finance & Commerce": "0.8000",
"Astronomy": "0.8000",
"Geography": "0.6333",
"Law":"0.7333",
"Life Tip": "0.7333",
"Technology":"0.7333",
"Animation": "0.8000",
"Movie & TV Show":"0.7333",
"Documentary":"0.5667",
"News Report":"0.8667",
"Esports":"0.6667",
"Basketball":"0.4333",
"Football": "0.7667",
"Athletics": "0.8000",
"Other Sports":"0.6667",
"Stage Play":"0.9000",
"Magic Show":"0.6667",
"Variety Show":"0.7667",
"Acrobatics":"0.7000",
"Handicraft":"0.8667",
"Food": "0.7333",
"Fashion":"0.7333",
"Daily Life":"0.7333",
"Travel":"0.7667",
"Pet & Animal":"0.8000",
"Exercise": "0.6000",
"Multilingual":"0.5333"
},
"task_type": {
"Temporal Perception":"0.8889",
"Spatial Perception": "0.7333",
"Attribute Perception":"0.8033",
"Action Recognition": "0.6718",
"Object Recognition": "0.7262",
"OCR Problems":"0.8596",
"Counting Problem":"0.4400",
"Temporal Reasoning": "0.8462",
"Spatial Reasoning": "0.8889",
"Action Reasoning":"0.7660",
"Object Reasoning":"0.7250",
"Information Synopsis":"0.8415"
}
},
"medium": {
"overall":"0.5911",
"domain": {
"Knowledge": "0.6074",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 259**


```
(continued from previous page)
"Film & Television": "0.6417",
"Sports Competition": "0.5067",
"Artistic Performance":"0.6583",
"Life Record":"0.5429",
"Multilingual":"0.7333"
},
"sub_category": {
"Humanity & History": "0.5333",
"Literature & Art":"0.6333",
"Biology & Medicine": "0.6000",
"Finance & Commerce": "0.6000",
"Astronomy": "0.5667",
"Geography": "0.5333",
"Law":"0.8000",
"Life Tip": "0.5667",
"Technology":"0.6333",
"Animation": "0.4000",
"Movie & TV Show":"0.7000",
"Documentary":"0.8000",
"News Report":"0.6667",
"Esports":"0.6333",
"Basketball":"0.1667",
"Football": "0.5333",
"Athletics": "0.6000",
"Other Sports":"0.6000",
"Stage Play":"0.7667",
"Magic Show":"0.6333",
"Variety Show":"0.5667",
"Acrobatics":"0.6667",
"Handicraft":"0.7000",
"Food": "0.3667",
"Fashion":"0.4333",
"Daily Life":"0.5333",
"Travel":"0.6333",
"Pet & Animal":"0.4000",
"Exercise": "0.7333",
"Multilingual":"0.7333"
},
"task_type": {
"Temporal Perception":"0.5484",
"Spatial Perception": "0.6190",
"Attribute Perception":"0.6712",
"Action Recognition": "0.5126",
"Object Recognition": "0.6667",
"OCR Problems":"0.5000",
"Counting Problem":"0.3579",
"Temporal Reasoning": "0.5068",
"Spatial Reasoning": "0.7778",
"Action Reasoning":"0.5345",
"Object Reasoning":"0.6716",
"Information Synopsis":"0.8205"
}
(continues on next page)
```
**260 Chapter 1. Documentation**


```
(continued from previous page)
},
"long": {
"overall":"0.5256",
"domain": {
"Knowledge": "0.5926",
"Film & Television": "0.4583",
"Sports Competition": "0.5267",
"Artistic Performance":"0.5417",
"Life Record":"0.4762",
"Multilingual":"0.4667"
},
"sub_category": {
"Humanity & History": "0.5333",
"Literature & Art":"0.6000",
"Biology & Medicine": "0.7000",
"Finance & Commerce": "0.7667",
"Astronomy": "0.5667",
"Geography": "0.4333",
"Law":"0.5000",
"Life Tip": "0.6333",
"Technology":"0.6000",
"Animation": "0.3000",
"Movie & TV Show":"0.5333",
"Documentary":"0.5333",
"News Report":"0.4667",
"Esports":"0.6667",
"Basketball":"0.3667",
"Football": "0.6000",
"Athletics": "0.4000",
"Other Sports":"0.6000",
"Stage Play":"0.7000",
"Magic Show":"0.5667",
"Variety Show":"0.3667",
"Acrobatics":"0.5333",
"Handicraft":"0.5667",
"Food": "0.3667",
"Fashion":"0.4000",
"Daily Life":"0.4333",
"Travel":"0.3667",
"Pet & Animal":"0.6333",
"Exercise": "0.5667",
"Multilingual":"0.4667"
},
"task_type": {
"Temporal Perception":"0.3333",
"Spatial Perception": "0.3333",
"Attribute Perception":"0.6667",
"Action Recognition": "0.5397",
"Object Recognition": "0.5185",
"OCR Problems":"0.4286",
"Counting Problem":"0.2917",
"Temporal Reasoning": "0.3297",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 261**


```
(continued from previous page)
"Spatial Reasoning": "0.6364",
"Action Reasoning":"0.5000",
"Object Reasoning":"0.5292",
"Information Synopsis":"0.7117"
}
},
"overall": {
"overall":"0.6122",
"domain": {
"Knowledge": "0.6407",
"Film & Television": "0.6139",
"Sports Competition": "0.5667",
"Artistic Performance":"0.6528",
"Life Record":"0.5889",
"Multilingual":"0.5778"
},
"sub_category": {
"Humanity & History": "0.5000",
"Literature & Art":"0.6333",
"Biology & Medicine": "0.7556",
"Finance & Commerce": "0.7222",
"Astronomy": "0.6444",
"Geography": "0.5333",
"Law":"0.6778",
"Life Tip": "0.6444",
"Technology":"0.6556",
"Animation": "0.5000",
"Movie & TV Show":"0.6556",
"Documentary":"0.6333",
"News Report":"0.6667",
"Esports":"0.6556",
"Basketball":"0.3222",
"Football": "0.6333",
"Athletics": "0.6000",
"Other Sports":"0.6222",
"Stage Play":"0.7889",
"Magic Show":"0.6222",
"Variety Show":"0.5667",
"Acrobatics":"0.6333",
"Handicraft":"0.7111",
"Food": "0.4889",
"Fashion":"0.5222",
"Daily Life":"0.5667",
"Travel":"0.5889",
"Pet & Animal":"0.6111",
"Exercise": "0.6333",
"Multilingual":"0.5778"
},
"task_type": {
"Temporal Perception":"0.6364",
"Spatial Perception": "0.6667",
"Attribute Perception":"0.7432",
(continues on next page)
```
**262 Chapter 1. Documentation**


(continued from previous page)
"Action Recognition": "0.5847",
"Object Recognition": "0.6723",
"OCR Problems":"0.6403",
"Counting Problem":"0.3843",
"Temporal Reasoning": "0.4407",
"Spatial Reasoning": "0.8036",
"Action Reasoning":"0.5509",
"Object Reasoning":"0.6057",
"Information Synopsis":"0.7709"
}
}
}

When testing with subtitles:

torchrun --nproc-per-node= 8 run.py --data Video-MME --model InternVL2-40B --verbose --
˓→nframe 16 --use-subtitle

The expected test results are:

{
"short": {
"overall":"0.7278",
"domain": {
"Knowledge": "0.7370",
"Film & Television": "0.7583",
"Sports Competition": "0.6800",
"Artistic Performance":"0.7750",
"Life Record":"0.7286",
"Multilingual":"0.5667"
},
"sub_category": {
"Humanity & History": "0.4333",
"Literature & Art":"0.6333",
"Biology & Medicine": "0.9667",
"Finance & Commerce": "0.8667",
"Astronomy": "0.8333",
"Geography": "0.7000",
"Law":"0.7667",
"Life Tip": "0.7000",
"Technology":"0.7333",
"Animation": "0.7667",
"Movie & TV Show":"0.7000",
"Documentary":"0.6667",
"News Report":"0.9000",
"Esports":"0.6667",
"Basketball":"0.3667",
"Football": "0.8000",
"Athletics": "0.8333",
"Other Sports":"0.7333",
"Stage Play":"0.8667",
"Magic Show":"0.7333",
(continues on next page)

**1.23. Evaluation of InternVL2 Series 263**


```
(continued from previous page)
"Variety Show":"0.8000",
"Acrobatics":"0.7000",
"Handicraft":"0.7667",
"Food": "0.8000",
"Fashion":"0.6667",
"Daily Life":"0.7333",
"Travel":"0.7667",
"Pet & Animal":"0.8000",
"Exercise": "0.5667",
"Multilingual":"0.5667"
},
"task_type": {
"Temporal Perception":"0.8889",
"Spatial Perception": "0.7333",
"Attribute Perception":"0.8115",
"Action Recognition": "0.6870",
"Object Recognition": "0.7202",
"OCR Problems":"0.8596",
"Counting Problem":"0.4640",
"Temporal Reasoning": "0.6923",
"Spatial Reasoning": "0.8889",
"Action Reasoning":"0.7234",
"Object Reasoning":"0.7625",
"Information Synopsis":"0.8780"
}
},
"medium": {
"overall":"0.6133",
"domain": {
"Knowledge": "0.6630",
"Film & Television": "0.6583",
"Sports Competition": "0.5133",
"Artistic Performance":"0.6917",
"Life Record":"0.5333",
"Multilingual":"0.7333"
},
"sub_category": {
"Humanity & History": "0.6000",
"Literature & Art":"0.7000",
"Biology & Medicine": "0.5667",
"Finance & Commerce": "0.7333",
"Astronomy": "0.7000",
"Geography": "0.5667",
"Law":"0.8333",
"Life Tip": "0.6667",
"Technology":"0.6000",
"Animation": "0.4333",
"Movie & TV Show":"0.7667",
"Documentary":"0.7333",
"News Report":"0.7000",
"Esports":"0.5667",
"Basketball":"0.2667",
(continues on next page)
```
**264 Chapter 1. Documentation**


```
(continued from previous page)
"Football": "0.5667",
"Athletics": "0.5667",
"Other Sports":"0.6000",
"Stage Play":"0.8000",
"Magic Show":"0.6333",
"Variety Show":"0.6667",
"Acrobatics":"0.6667",
"Handicraft":"0.7000",
"Food": "0.3333",
"Fashion":"0.4000",
"Daily Life":"0.5333",
"Travel":"0.6333",
"Pet & Animal":"0.4667",
"Exercise": "0.6667",
"Multilingual":"0.7333"
},
"task_type": {
"Temporal Perception":"0.5484",
"Spatial Perception": "0.5238",
"Attribute Perception":"0.6438",
"Action Recognition": "0.5798",
"Object Recognition": "0.7121",
"OCR Problems":"0.4706",
"Counting Problem":"0.3684",
"Temporal Reasoning": "0.5479",
"Spatial Reasoning": "0.8333",
"Action Reasoning":"0.6034",
"Object Reasoning":"0.6791",
"Information Synopsis":"0.8462"
}
},
"long": {
"overall":"0.5300",
"domain": {
"Knowledge": "0.5889",
"Film & Television": "0.5000",
"Sports Competition": "0.5000",
"Artistic Performance":"0.6000",
"Life Record":"0.4571",
"Multilingual":"0.5000"
},
"sub_category": {
"Humanity & History": "0.5333",
"Literature & Art":"0.6000",
"Biology & Medicine": "0.6333",
"Finance & Commerce": "0.6333",
"Astronomy": "0.6667",
"Geography": "0.4000",
"Law":"0.7000",
"Life Tip": "0.6000",
"Technology":"0.5333",
"Animation": "0.3667",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 265**


```
(continued from previous page)
"Movie & TV Show":"0.5000",
"Documentary":"0.5333",
"News Report":"0.6000",
"Esports":"0.6000",
"Basketball":"0.3333",
"Football": "0.6333",
"Athletics": "0.4000",
"Other Sports":"0.5333",
"Stage Play":"0.8667",
"Magic Show":"0.5667",
"Variety Show":"0.4333",
"Acrobatics":"0.5333",
"Handicraft":"0.5667",
"Food": "0.3333",
"Fashion":"0.3667",
"Daily Life":"0.4333",
"Travel":"0.4000",
"Pet & Animal":"0.6667",
"Exercise": "0.4333",
"Multilingual":"0.5000"
},
"task_type": {
"Temporal Perception":"0.1667",
"Spatial Perception": "0.3333",
"Attribute Perception":"0.6296",
"Action Recognition": "0.5714",
"Object Recognition": "0.5185",
"OCR Problems":"0.5714",
"Counting Problem":"0.2708",
"Temporal Reasoning": "0.3187",
"Spatial Reasoning": "0.6364",
"Action Reasoning":"0.4889",
"Object Reasoning":"0.5417",
"Information Synopsis":"0.7301"
}
},
"overall": {
"overall":"0.6237",
"domain": {
"Knowledge": "0.6630",
"Film & Television": "0.6389",
"Sports Competition": "0.5644",
"Artistic Performance":"0.6889",
"Life Record":"0.5730",
"Multilingual":"0.6000"
},
"sub_category": {
"Humanity & History": "0.5222",
"Literature & Art":"0.6444",
"Biology & Medicine": "0.7222",
"Finance & Commerce": "0.7444",
"Astronomy": "0.7333",
(continues on next page)
```
**266 Chapter 1. Documentation**


(continued from previous page)
"Geography": "0.5556",
"Law":"0.7667",
"Life Tip": "0.6556",
"Technology":"0.6222",
"Animation": "0.5222",
"Movie & TV Show":"0.6556",
"Documentary":"0.6444",
"News Report":"0.7333",
"Esports":"0.6111",
"Basketball":"0.3222",
"Football": "0.6667",
"Athletics": "0.6000",
"Other Sports":"0.6222",
"Stage Play":"0.8444",
"Magic Show":"0.6444",
"Variety Show":"0.6333",
"Acrobatics":"0.6333",
"Handicraft":"0.6778",
"Food": "0.4889",
"Fashion":"0.4778",
"Daily Life":"0.5667",
"Travel":"0.6000",
"Pet & Animal":"0.6444",
"Exercise": "0.5556",
"Multilingual":"0.6000"
},
"task_type": {
"Temporal Perception":"0.6182",
"Spatial Perception": "0.6296",
"Attribute Perception":"0.7342",
"Action Recognition": "0.6230",
"Object Recognition": "0.6864",
"OCR Problems":"0.6403",
"Counting Problem":"0.3955",
"Temporal Reasoning": "0.4407",
"Spatial Reasoning": "0.8214",
"Action Reasoning":"0.5509",
"Object Reasoning":"0.6211",
"Information Synopsis":"0.7957"
}
}
}

When testing without subtitles:

torchrun --nproc-per-node= 1 run.py --data Video-MME --model InternVL2-76B --verbose --
˓→nframe 16

The expected test results are:

{
"short": {
(continues on next page)

**1.23. Evaluation of InternVL2 Series 267**


```
(continued from previous page)
"overall":"0.7222",
"domain": {
"Knowledge": "0.7593",
"Film & Television": "0.7167",
"Sports Competition": "0.6800",
"Artistic Performance":"0.7500",
"Life Record":"0.7143",
"Multilingual":"0.5667"
},
"sub_category": {
"Humanity & History": "0.5333",
"Literature & Art":"0.6667",
"Biology & Medicine": "0.9333",
"Finance & Commerce": "0.8333",
"Astronomy": "0.7667",
"Geography": "0.7333",
"Law":"0.8000",
"Life Tip": "0.7667",
"Technology":"0.8000",
"Animation": "0.8000",
"Movie & TV Show":"0.6333",
"Documentary":"0.5667",
"News Report":"0.8667",
"Esports":"0.6667",
"Basketball":"0.6000",
"Football": "0.7667",
"Athletics": "0.7333",
"Other Sports":"0.6333",
"Stage Play":"0.8667",
"Magic Show":"0.6667",
"Variety Show":"0.7333",
"Acrobatics":"0.7333",
"Handicraft":"0.8000",
"Food": "0.7333",
"Fashion":"0.6000",
"Daily Life":"0.7333",
"Travel":"0.8667",
"Pet & Animal":"0.7667",
"Exercise": "0.5000",
"Multilingual":"0.5667"
},
"task_type": {
"Temporal Perception":"0.9444",
"Spatial Perception": "0.8333",
"Attribute Perception":"0.7869",
"Action Recognition": "0.6870",
"Object Recognition": "0.6786",
"OCR Problems":"0.8596",
"Counting Problem":"0.4400",
"Temporal Reasoning": "0.6923",
"Spatial Reasoning": "0.8519",
"Action Reasoning":"0.8085",
(continues on next page)
```
**268 Chapter 1. Documentation**


```
(continued from previous page)
"Object Reasoning":"0.8000",
"Information Synopsis":"0.8537"
}
},
"medium": {
"overall":"0.5800",
"domain": {
"Knowledge": "0.5741",
"Film & Television": "0.6833",
"Sports Competition": "0.5200",
"Artistic Performance":"0.6833",
"Life Record":"0.5095",
"Multilingual":"0.6000"
},
"sub_category": {
"Humanity & History": "0.5000",
"Literature & Art":"0.6000",
"Biology & Medicine": "0.5667",
"Finance & Commerce": "0.6333",
"Astronomy": "0.6000",
"Geography": "0.5000",
"Law":"0.6333",
"Life Tip": "0.6000",
"Technology":"0.5333",
"Animation": "0.6000",
"Movie & TV Show":"0.7667",
"Documentary":"0.7667",
"News Report":"0.6000",
"Esports":"0.5000",
"Basketball":"0.4000",
"Football": "0.6000",
"Athletics": "0.4667",
"Other Sports":"0.6333",
"Stage Play":"0.8000",
"Magic Show":"0.6333",
"Variety Show":"0.6000",
"Acrobatics":"0.7000",
"Handicraft":"0.7333",
"Food": "0.3000",
"Fashion":"0.4000",
"Daily Life":"0.3667",
"Travel":"0.5667",
"Pet & Animal":"0.6333",
"Exercise": "0.5667",
"Multilingual":"0.6000"
},
"task_type": {
"Temporal Perception":"0.5806",
"Spatial Perception": "0.5238",
"Attribute Perception":"0.6027",
"Action Recognition": "0.5546",
"Object Recognition": "0.6212",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 269**


```
(continued from previous page)
"OCR Problems":"0.5000",
"Counting Problem":"0.4000",
"Temporal Reasoning": "0.3836",
"Spatial Reasoning": "0.7222",
"Action Reasoning":"0.6207",
"Object Reasoning":"0.6642",
"Information Synopsis":"0.8077"
}
},
"long": {
"overall":"0.5333",
"domain": {
"Knowledge": "0.5926",
"Film & Television": "0.4667",
"Sports Competition": "0.5200",
"Artistic Performance":"0.5750",
"Life Record":"0.4810",
"Multilingual":"0.5333"
},
"sub_category": {
"Humanity & History": "0.5333",
"Literature & Art":"0.6000",
"Biology & Medicine": "0.5667",
"Finance & Commerce": "0.6667",
"Astronomy": "0.7333",
"Geography": "0.5000",
"Law":"0.5333",
"Life Tip": "0.7000",
"Technology":"0.5000",
"Animation": "0.4000",
"Movie & TV Show":"0.4000",
"Documentary":"0.4667",
"News Report":"0.6000",
"Esports":"0.4333",
"Basketball":"0.5333",
"Football": "0.5667",
"Athletics": "0.5000",
"Other Sports":"0.5667",
"Stage Play":"0.7333",
"Magic Show":"0.5667",
"Variety Show":"0.3333",
"Acrobatics":"0.6667",
"Handicraft":"0.5667",
"Food": "0.3667",
"Fashion":"0.5000",
"Daily Life":"0.4667",
"Travel":"0.3667",
"Pet & Animal":"0.7000",
"Exercise": "0.4000",
"Multilingual":"0.5333"
},
"task_type": {
(continues on next page)
```
**270 Chapter 1. Documentation**


```
(continued from previous page)
"Temporal Perception":"0.5000",
"Spatial Perception": "0.3333",
"Attribute Perception":"0.5185",
"Action Recognition": "0.5556",
"Object Recognition": "0.5741",
"OCR Problems":"0.3571",
"Counting Problem":"0.3750",
"Temporal Reasoning": "0.4835",
"Spatial Reasoning": "0.6364",
"Action Reasoning":"0.4778",
"Object Reasoning":"0.5250",
"Information Synopsis":"0.6748"
}
},
"overall": {
"overall":"0.6119",
"domain": {
"Knowledge": "0.6420",
"Film & Television": "0.6222",
"Sports Competition": "0.5733",
"Artistic Performance":"0.6694",
"Life Record":"0.5683",
"Multilingual":"0.5667"
},
"sub_category": {
"Humanity & History": "0.5222",
"Literature & Art":"0.6222",
"Biology & Medicine": "0.6889",
"Finance & Commerce": "0.7111",
"Astronomy": "0.7000",
"Geography": "0.5778",
"Law":"0.6556",
"Life Tip": "0.6889",
"Technology":"0.6111",
"Animation": "0.6000",
"Movie & TV Show":"0.6000",
"Documentary":"0.6000",
"News Report":"0.6889",
"Esports":"0.5333",
"Basketball":"0.5111",
"Football": "0.6444",
"Athletics": "0.5667",
"Other Sports":"0.6111",
"Stage Play":"0.8000",
"Magic Show":"0.6222",
"Variety Show":"0.5556",
"Acrobatics":"0.7000",
"Handicraft":"0.7000",
"Food": "0.4667",
"Fashion":"0.5000",
"Daily Life":"0.5222",
"Travel":"0.6000",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 271**


(continued from previous page)
"Pet & Animal":"0.7000",
"Exercise": "0.4889",
"Multilingual":"0.5667"
},
"task_type": {
"Temporal Perception":"0.6909",
"Spatial Perception": "0.6852",
"Attribute Perception":"0.6937",
"Action Recognition": "0.6102",
"Object Recognition": "0.6412",
"OCR Problems":"0.6331",
"Counting Problem":"0.4142",
"Temporal Reasoning": "0.4576",
"Spatial Reasoning": "0.7679",
"Action Reasoning":"0.5614",
"Object Reasoning":"0.6145",
"Information Synopsis":"0.7523"
}
}
}

When testing with subtitles:

torchrun --nproc-per-node= 1 run.py --data Video-MME --model InternVL2-76B --verbose --
˓→nframe 16 --use-subtitle

The expected test results are:

{
"short": {
"overall":"0.7422",
"domain": {
"Knowledge": "0.7667",
"Film & Television": "0.7583",
"Sports Competition": "0.7067",
"Artistic Performance":"0.7833",
"Life Record":"0.7286",
"Multilingual":"0.5667"
},
"sub_category": {
"Humanity & History": "0.5000",
"Literature & Art":"0.6667",
"Biology & Medicine": "0.9667",
"Finance & Commerce": "0.8667",
"Astronomy": "0.8000",
"Geography": "0.7667",
"Law":"0.8000",
"Life Tip": "0.7667",
"Technology":"0.7667",
"Animation": "0.7667",
"Movie & TV Show":"0.7000",
"Documentary":"0.6667",
(continues on next page)

**272 Chapter 1. Documentation**


```
(continued from previous page)
"News Report":"0.9000",
"Esports":"0.7000",
"Basketball":"0.5000",
"Football": "0.7667",
"Athletics": "0.8333",
"Other Sports":"0.7333",
"Stage Play":"0.8333",
"Magic Show":"0.7667",
"Variety Show":"0.8000",
"Acrobatics":"0.7333",
"Handicraft":"0.8000",
"Food": "0.8000",
"Fashion":"0.6333",
"Daily Life":"0.7333",
"Travel":"0.8667",
"Pet & Animal":"0.7333",
"Exercise": "0.5333",
"Multilingual":"0.5667"
},
"task_type": {
"Temporal Perception":"0.8889",
"Spatial Perception": "0.8000",
"Attribute Perception":"0.8115",
"Action Recognition": "0.7023",
"Object Recognition": "0.6964",
"OCR Problems":"0.9123",
"Counting Problem":"0.4720",
"Temporal Reasoning": "0.7692",
"Spatial Reasoning": "0.8519",
"Action Reasoning":"0.8511",
"Object Reasoning":"0.7875",
"Information Synopsis":"0.8902"
}
},
"medium": {
"overall":"0.5900",
"domain": {
"Knowledge": "0.6111",
"Film & Television": "0.7083",
"Sports Competition": "0.4800",
"Artistic Performance":"0.7083",
"Life Record":"0.5048",
"Multilingual":"0.6000"
},
"sub_category": {
"Humanity & History": "0.6000",
"Literature & Art":"0.6333",
"Biology & Medicine": "0.5667",
"Finance & Commerce": "0.6333",
"Astronomy": "0.6333",
"Geography": "0.6000",
"Law":"0.6667",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 273**


```
(continued from previous page)
"Life Tip": "0.6333",
"Technology":"0.5333",
"Animation": "0.5333",
"Movie & TV Show":"0.8000",
"Documentary":"0.7667",
"News Report":"0.7333",
"Esports":"0.5000",
"Basketball":"0.3000",
"Football": "0.5667",
"Athletics": "0.4667",
"Other Sports":"0.5667",
"Stage Play":"0.8333",
"Magic Show":"0.6667",
"Variety Show":"0.6000",
"Acrobatics":"0.7333",
"Handicraft":"0.7333",
"Food": "0.3333",
"Fashion":"0.3333",
"Daily Life":"0.4333",
"Travel":"0.5333",
"Pet & Animal":"0.6333",
"Exercise": "0.5333",
"Multilingual":"0.6000"
},
"task_type": {
"Temporal Perception":"0.5161",
"Spatial Perception": "0.5238",
"Attribute Perception":"0.6027",
"Action Recognition": "0.5546",
"Object Recognition": "0.6439",
"OCR Problems":"0.5147",
"Counting Problem":"0.3579",
"Temporal Reasoning": "0.3973",
"Spatial Reasoning": "0.8889",
"Action Reasoning":"0.6207",
"Object Reasoning":"0.6791",
"Information Synopsis":"0.8718"
}
},
"long": {
"overall":"0.5522",
"domain": {
"Knowledge": "0.6222",
"Film & Television": "0.5167",
"Sports Competition": "0.5267",
"Artistic Performance":"0.5750",
"Life Record":"0.4905",
"Multilingual":"0.5333"
},
"sub_category": {
"Humanity & History": "0.6333",
"Literature & Art":"0.7000",
(continues on next page)
```
**274 Chapter 1. Documentation**


```
(continued from previous page)
"Biology & Medicine": "0.6000",
"Finance & Commerce": "0.7667",
"Astronomy": "0.6000",
"Geography": "0.5333",
"Law":"0.6667",
"Life Tip": "0.6333",
"Technology":"0.4667",
"Animation": "0.4667",
"Movie & TV Show":"0.4333",
"Documentary":"0.5333",
"News Report":"0.6333",
"Esports":"0.5333",
"Basketball":"0.4333",
"Football": "0.6333",
"Athletics": "0.5000",
"Other Sports":"0.5333",
"Stage Play":"0.7333",
"Magic Show":"0.5667",
"Variety Show":"0.3667",
"Acrobatics":"0.6333",
"Handicraft":"0.5667",
"Food": "0.3667",
"Fashion":"0.4667",
"Daily Life":"0.4667",
"Travel":"0.4333",
"Pet & Animal":"0.7000",
"Exercise": "0.4333",
"Multilingual":"0.5333"
},
"task_type": {
"Temporal Perception":"0.5000",
"Spatial Perception": "0.6667",
"Attribute Perception":"0.6667",
"Action Recognition": "0.5238",
"Object Recognition": "0.5000",
"OCR Problems":"0.5714",
"Counting Problem":"0.2917",
"Temporal Reasoning": "0.5165",
"Spatial Reasoning": "0.6364",
"Action Reasoning":"0.4944",
"Object Reasoning":"0.5458",
"Information Synopsis":"0.7239"
}
},
"overall": {
"overall":"0.6281",
"domain": {
"Knowledge": "0.6667",
"Film & Television": "0.6611",
"Sports Competition": "0.5711",
"Artistic Performance":"0.6889",
"Life Record":"0.5746",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 275**


(continued from previous page)
"Multilingual":"0.5667"
},
"sub_category": {
"Humanity & History": "0.5778",
"Literature & Art":"0.6667",
"Biology & Medicine": "0.7111",
"Finance & Commerce": "0.7556",
"Astronomy": "0.6778",
"Geography": "0.6333",
"Law":"0.7111",
"Life Tip": "0.6778",
"Technology":"0.5889",
"Animation": "0.5889",
"Movie & TV Show":"0.6444",
"Documentary":"0.6556",
"News Report":"0.7556",
"Esports":"0.5778",
"Basketball":"0.4111",
"Football": "0.6556",
"Athletics": "0.6000",
"Other Sports":"0.6111",
"Stage Play":"0.8000",
"Magic Show":"0.6667",
"Variety Show":"0.5889",
"Acrobatics":"0.7000",
"Handicraft":"0.7000",
"Food": "0.5000",
"Fashion":"0.4778",
"Daily Life":"0.5444",
"Travel":"0.6111",
"Pet & Animal":"0.6889",
"Exercise": "0.5000",
"Multilingual":"0.5667"
},
"task_type": {
"Temporal Perception":"0.6364",
"Spatial Perception": "0.6852",
"Attribute Perception":"0.7252",
"Action Recognition": "0.6102",
"Object Recognition": "0.6469",
"OCR Problems":"0.6835",
"Counting Problem":"0.3993",
"Temporal Reasoning": "0.4859",
"Spatial Reasoning": "0.8214",
"Action Reasoning":"0.5789",
"Object Reasoning":"0.6278",
"Information Synopsis":"0.8019"
}
}
}

**276 Chapter 1. Documentation**


**MMBench-Video**

MMBench-Video is a benchmark designed to evaluate the prociency of MLLMs in understanding video content. It ad-
dresses the limitations of traditional VideoQA benchmarks by incorporating long-form videos sourced from YouTube,
which better reect real-world scenarios. The benchmark uses free-form questions that require temporal reasoning,
which are human-annotated based on a comprehensive capability taxonomy.

1B

2B

4B

8B

26B

40B

76B

When testing with 8 frames:

torchrun --nproc-per-node= 8 run.py --data MMBench-Video --model InternVL2-1B --verbose --
˓→nframe 8

The expected test results are:

{
"coarse_all": {
"CP":"1.11",
"FP-S": "1.00",
"FP-C": "0.84",
"HL":"0.27",
"LR":"0.71",
"AR":"1.01",
"RR":"1.17",
"CSR":"0.77",
"TR":"0.71",
"Perception":"0.97",
"Reasoning": "0.88",
"Overall":"0.95"
},
"coarse_valid": {
"CP":"1.11",
"FP-S": "1.00",
"FP-C": "0.84",
"HL":"0.27",
"LR":"0.71",
"AR":"1.01",
"RR":"1.17",
"CSR":"0.77",
"TR":"0.71",
"Perception":"0.97",
"Reasoning": "0.88",
"Overall":"0.95"
},
"fine_all": {
(continues on next page)

**1.23. Evaluation of InternVL2 Series 277**


```
(continued from previous page)
"Video Topic":"1.05",
"Video Emotion": "1.27",
"Video Scene":"0.84",
"Video Style":"1.38",
"OCR":"0.87",
"Object Recognition": "1.07",
"Attribute Recognition": "1.41",
"Event Recognition": "0.93",
"Human Motion": "0.84",
"Counting": "0.99",
"Spatial Relationship":"1.16",
"Human-object Interaction":"0.80",
"Human Interaction": "0.70",
"Hallucination": "0.27",
"Structuralized Image-Text Understanding": "0.97",
"Mathematical Calculation":"0.31",
"Physical Property": "0.78",
"Function Reasoning": "0.95",
"Identity Reasoning": "1.30",
"Natural Relation":"1.04",
"Physical Relation": "0.92",
"Social Relation":"1.48",
"Common Sense Reasoning": "0.77",
"Counterfactual Reasoning":"0.80",
"Causal Reasoning":"0.67",
"Future Prediction": "0.77"
},
"fine_valid": {
"Video Topic":"1.05",
"Video Emotion": "1.27",
"Video Scene":"0.84",
"Video Style":"1.38",
"OCR":"0.87",
"Object Recognition": "1.07",
"Attribute Recognition": "1.41",
"Event Recognition": "0.93",
"Human Motion": "0.84",
"Counting": "0.99",
"Spatial Relationship":"1.16",
"Human-object Interaction":"0.80",
"Human Interaction": "0.70",
"Hallucination": "0.27",
"Structuralized Image-Text Understanding": "0.97",
"Mathematical Calculation":"0.31",
"Physical Property": "0.78",
"Function Reasoning": "0.95",
"Identity Reasoning": "1.30",
"Natural Relation":"1.04",
"Physical Relation": "0.92",
"Social Relation":"1.48",
"Common Sense Reasoning": "0.77",
"Counterfactual Reasoning":"0.80",
(continues on next page)
```
**278 Chapter 1. Documentation**


(continued from previous page)
"Causal Reasoning":"0.67",
"Future Prediction": "0.77"
}
}

When testing with 16 frames:

torchrun --nproc-per-node= 8 run.py --data MMBench-Video --model InternVL2-1B --verbose --
˓→nframe 16

The expected test results are:

{
"coarse_all": {
"CP":"1.21",
"FP-S": "1.03",
"FP-C": "0.85",
"HL":"0.29",
"LR":"0.73",
"AR":"1.00",
"RR":"1.26",
"CSR":"0.70",
"TR":"0.74",
"Perception":"1.00",
"Reasoning": "0.90",
"Overall":"0.98"
},
"coarse_valid": {
"CP":"1.21",
"FP-S": "1.03",
"FP-C": "0.85",
"HL":"0.29",
"LR":"0.73",
"AR":"1.00",
"RR":"1.26",
"CSR":"0.70",
"TR":"0.74",
"Perception":"1.00",
"Reasoning": "0.90",
"Overall":"0.98"
},
"fine_all": {
"Video Topic":"1.15",
"Video Emotion": "1.37",
"Video Scene":"0.96",
"Video Style":"1.43",
"OCR":"0.96",
"Object Recognition": "1.08",
"Attribute Recognition": "1.47",
"Event Recognition": "0.86",
"Human Motion": "0.77",
"Counting": "0.94",
(continues on next page)

**1.23. Evaluation of InternVL2 Series 279**


(continued from previous page)
"Spatial Relationship":"1.09",
"Human-object Interaction":"0.85",
"Human Interaction": "0.64",
"Hallucination": "0.29",
"Structuralized Image-Text Understanding": "0.96",
"Mathematical Calculation":"0.38",
"Physical Property": "0.76",
"Function Reasoning": "0.89",
"Identity Reasoning": "1.36",
"Natural Relation":"1.00",
"Physical Relation": "1.10",
"Social Relation":"1.54",
"Common Sense Reasoning": "0.70",
"Counterfactual Reasoning":"0.88",
"Causal Reasoning":"0.72",
"Future Prediction": "0.74"
},
"fine_valid": {
"Video Topic":"1.15",
"Video Emotion": "1.37",
"Video Scene":"0.96",
"Video Style":"1.43",
"OCR":"0.96",
"Object Recognition": "1.08",
"Attribute Recognition": "1.47",
"Event Recognition": "0.86",
"Human Motion": "0.77",
"Counting": "0.94",
"Spatial Relationship":"1.09",
"Human-object Interaction":"0.85",
"Human Interaction": "0.64",
"Hallucination": "0.29",
"Structuralized Image-Text Understanding": "0.96",
"Mathematical Calculation":"0.38",
"Physical Property": "0.76",
"Function Reasoning": "0.89",
"Identity Reasoning": "1.36",
"Natural Relation":"1.00",
"Physical Relation": "1.10",
"Social Relation":"1.54",
"Common Sense Reasoning": "0.70",
"Counterfactual Reasoning":"0.88",
"Causal Reasoning":"0.72",
"Future Prediction": "0.74"
}
}

When testing with 8 frames:

torchrun --nproc-per-node= 8 run.py --data MMBench-Video --model InternVL2-2B --verbose --
˓→nframe 8

The expected test results are:

**280 Chapter 1. Documentation**


### {

```
"coarse_all": {
"CP":"1.16",
"FP-S": "1.05",
"FP-C": "0.81",
"HL":"0.26",
"LR":"0.50",
"AR":"1.12",
"RR":"1.11",
"CSR":"0.81",
"TR":"0.83",
"Perception":"1.00",
"Reasoning": "0.91",
"Overall":"0.97"
},
"coarse_valid": {
"CP":"1.16",
"FP-S": "1.05",
"FP-C": "0.81",
"HL":"0.26",
"LR":"0.50",
"AR":"1.12",
"RR":"1.11",
"CSR":"0.81",
"TR":"0.83",
"Perception":"1.00",
"Reasoning": "0.91",
"Overall":"0.97"
},
"fine_all": {
"Video Topic":"1.12",
"Video Emotion": "1.29",
"Video Scene":"0.99",
"Video Style":"1.24",
"OCR":"0.94",
"Object Recognition": "1.04",
"Attribute Recognition": "1.46",
"Event Recognition": "1.02",
"Human Motion": "0.66",
"Counting": "1.16",
"Spatial Relationship":"0.93",
"Human-object Interaction":"0.77",
"Human Interaction": "0.77",
"Hallucination": "0.26",
"Structuralized Image-Text Understanding": "0.69",
"Mathematical Calculation":"0.22",
"Physical Property": "0.94",
"Function Reasoning": "1.09",
"Identity Reasoning": "1.32",
"Natural Relation":"0.93",
"Physical Relation": "0.98",
"Social Relation":"1.33",
"Common Sense Reasoning": "0.81",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 281**


(continued from previous page)
"Counterfactual Reasoning":"1.00",
"Causal Reasoning":"0.76",
"Future Prediction": "0.87"
},
"fine_valid": {
"Video Topic":"1.12",
"Video Emotion": "1.29",
"Video Scene":"0.99",
"Video Style":"1.24",
"OCR":"0.94",
"Object Recognition": "1.04",
"Attribute Recognition": "1.46",
"Event Recognition": "1.02",
"Human Motion": "0.66",
"Counting": "1.16",
"Spatial Relationship":"0.93",
"Human-object Interaction":"0.77",
"Human Interaction": "0.77",
"Hallucination": "0.26",
"Structuralized Image-Text Understanding": "0.69",
"Mathematical Calculation":"0.22",
"Physical Property": "0.94",
"Function Reasoning": "1.09",
"Identity Reasoning": "1.32",
"Natural Relation":"0.93",
"Physical Relation": "0.98",
"Social Relation":"1.33",
"Common Sense Reasoning": "0.81",
"Counterfactual Reasoning":"1.00",
"Causal Reasoning":"0.76",
"Future Prediction": "0.87"
}
}

When testing with 16 frames:

torchrun --nproc-per-node= 8 run.py --data MMBench-Video --model InternVL2-2B --verbose --
˓→nframe 16

The expected test results are:

{
"coarse_all": {
"CP":"1.22",
"FP-S": "1.13",
"FP-C": "0.80",
"HL":"0.34",
"LR":"0.64",
"AR":"1.01",
"RR":"1.23",
"CSR":"0.88",
"TR":"0.87",
(continues on next page)

**282 Chapter 1. Documentation**


```
(continued from previous page)
"Perception":"1.06",
"Reasoning": "0.95",
"Overall":"1.03"
},
"coarse_valid": {
"CP":"1.22",
"FP-S": "1.13",
"FP-C": "0.80",
"HL":"0.34",
"LR":"0.64",
"AR":"1.01",
"RR":"1.23",
"CSR":"0.88",
"TR":"0.87",
"Perception":"1.06",
"Reasoning": "0.95",
"Overall":"1.03"
},
"fine_all": {
"Video Topic":"1.14",
"Video Emotion": "1.29",
"Video Scene":"1.17",
"Video Style":"1.21",
"OCR":"1.02",
"Object Recognition": "1.13",
"Attribute Recognition": "1.59",
"Event Recognition": "0.99",
"Human Motion": "0.72",
"Counting": "1.24",
"Spatial Relationship":"1.02",
"Human-object Interaction":"0.67",
"Human Interaction": "0.85",
"Hallucination": "0.34",
"Structuralized Image-Text Understanding": "0.79",
"Mathematical Calculation":"0.40",
"Physical Property": "0.85",
"Function Reasoning": "1.07",
"Identity Reasoning": "1.11",
"Natural Relation":"1.15",
"Physical Relation": "1.00",
"Social Relation":"1.48",
"Common Sense Reasoning": "0.88",
"Counterfactual Reasoning":"1.10",
"Causal Reasoning":"0.82",
"Future Prediction": "0.81"
},
"fine_valid": {
"Video Topic":"1.14",
"Video Emotion": "1.29",
"Video Scene":"1.17",
"Video Style":"1.21",
"OCR":"1.02",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 283**


(continued from previous page)
"Object Recognition": "1.13",
"Attribute Recognition": "1.59",
"Event Recognition": "0.99",
"Human Motion": "0.72",
"Counting": "1.24",
"Spatial Relationship":"1.02",
"Human-object Interaction":"0.67",
"Human Interaction": "0.85",
"Hallucination": "0.34",
"Structuralized Image-Text Understanding": "0.79",
"Mathematical Calculation":"0.40",
"Physical Property": "0.85",
"Function Reasoning": "1.07",
"Identity Reasoning": "1.11",
"Natural Relation":"1.15",
"Physical Relation": "1.00",
"Social Relation":"1.48",
"Common Sense Reasoning": "0.88",
"Counterfactual Reasoning":"1.10",
"Causal Reasoning":"0.82",
"Future Prediction": "0.81"
}
}

When testing with 8 frames:

torchrun --nproc-per-node= 8 run.py --data MMBench-Video --model InternVL2-4B --verbose --
˓→nframe 8

The expected test results are:

{
"coarse_all": {
"CP":"1.34",
"FP-S": "1.16",
"FP-C": "0.97",
"HL":"0.13",
"LR":"0.58",
"AR":"1.16",
"RR":"1.26",
"CSR":"1.02",
"TR":"0.99",
"Perception":"1.13",
"Reasoning": "1.03",
"Overall":"1.10"
},
"coarse_valid": {
"CP":"1.34",
"FP-S": "1.16",
"FP-C": "0.97",
"HL":"0.13",
"LR":"0.58",
(continues on next page)

**284 Chapter 1. Documentation**


```
(continued from previous page)
"AR":"1.16",
"RR":"1.26",
"CSR":"1.02",
"TR":"0.99",
"Perception":"1.13",
"Reasoning": "1.03",
"Overall":"1.10"
},
"fine_all": {
"Video Topic":"1.30",
"Video Emotion": "1.43",
"Video Scene":"1.18",
"Video Style":"1.62",
"OCR":"0.98",
"Object Recognition": "1.24",
"Attribute Recognition": "1.53",
"Event Recognition": "1.11",
"Human Motion": "0.95",
"Counting": "1.31",
"Spatial Relationship":"1.07",
"Human-object Interaction":"0.95",
"Human Interaction": "0.95",
"Hallucination": "0.13",
"Structuralized Image-Text Understanding": "0.75",
"Mathematical Calculation":"0.33",
"Physical Property": "1.11",
"Function Reasoning": "1.07",
"Identity Reasoning": "1.30",
"Natural Relation":"0.96",
"Physical Relation": "1.25",
"Social Relation":"1.41",
"Common Sense Reasoning": "1.02",
"Counterfactual Reasoning":"0.97",
"Causal Reasoning":"0.98",
"Future Prediction": "1.02"
},
"fine_valid": {
"Video Topic":"1.30",
"Video Emotion": "1.43",
"Video Scene":"1.18",
"Video Style":"1.62",
"OCR":"0.98",
"Object Recognition": "1.24",
"Attribute Recognition": "1.53",
"Event Recognition": "1.11",
"Human Motion": "0.95",
"Counting": "1.31",
"Spatial Relationship":"1.07",
"Human-object Interaction":"0.95",
"Human Interaction": "0.95",
"Hallucination": "0.13",
"Structuralized Image-Text Understanding": "0.75",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 285**


(continued from previous page)
"Mathematical Calculation":"0.33",
"Physical Property": "1.11",
"Function Reasoning": "1.07",
"Identity Reasoning": "1.30",
"Natural Relation":"0.96",
"Physical Relation": "1.25",
"Social Relation":"1.41",
"Common Sense Reasoning": "1.02",
"Counterfactual Reasoning":"0.97",
"Causal Reasoning":"0.98",
"Future Prediction": "1.02"
}
}

When testing with 16 frames:

torchrun --nproc-per-node= 8 run.py --data MMBench-Video --model InternVL2-4B --verbose --
˓→nframe 16

The expected test results are:

{
"coarse_all": {
"CP":"1.38",
"FP-S": "1.27",
"FP-C": "1.03",
"HL":"0.15",
"LR":"0.73",
"AR":"1.24",
"RR":"1.29",
"CSR":"1.17",
"TR":"0.99",
"Perception":"1.22",
"Reasoning": "1.09",
"Overall":"1.18"
},
"coarse_valid": {
"CP":"1.38",
"FP-S": "1.27",
"FP-C": "1.03",
"HL":"0.15",
"LR":"0.73",
"AR":"1.24",
"RR":"1.29",
"CSR":"1.17",
"TR":"0.99",
"Perception":"1.22",
"Reasoning": "1.09",
"Overall":"1.18"
},
"fine_all": {
"Video Topic":"1.31",
(continues on next page)

**286 Chapter 1. Documentation**


```
(continued from previous page)
"Video Emotion": "1.47",
"Video Scene":"1.22",
"Video Style":"1.74",
"OCR":"1.19",
"Object Recognition": "1.29",
"Attribute Recognition": "1.62",
"Event Recognition": "1.13",
"Human Motion": "1.02",
"Counting": "1.25",
"Spatial Relationship":"1.16",
"Human-object Interaction":"0.99",
"Human Interaction": "1.00",
"Hallucination": "0.15",
"Structuralized Image-Text Understanding": "0.87",
"Mathematical Calculation":"0.51",
"Physical Property": "1.17",
"Function Reasoning": "1.05",
"Identity Reasoning": "1.49",
"Natural Relation":"1.00",
"Physical Relation": "1.25",
"Social Relation":"1.46",
"Common Sense Reasoning": "1.17",
"Counterfactual Reasoning":"1.05",
"Causal Reasoning":"0.96",
"Future Prediction": "1.04"
},
"fine_valid": {
"Video Topic":"1.31",
"Video Emotion": "1.47",
"Video Scene":"1.22",
"Video Style":"1.74",
"OCR":"1.19",
"Object Recognition": "1.29",
"Attribute Recognition": "1.62",
"Event Recognition": "1.13",
"Human Motion": "1.02",
"Counting": "1.25",
"Spatial Relationship":"1.16",
"Human-object Interaction":"0.99",
"Human Interaction": "1.00",
"Hallucination": "0.15",
"Structuralized Image-Text Understanding": "0.87",
"Mathematical Calculation":"0.51",
"Physical Property": "1.17",
"Function Reasoning": "1.05",
"Identity Reasoning": "1.49",
"Natural Relation":"1.00",
"Physical Relation": "1.25",
"Social Relation":"1.46",
"Common Sense Reasoning": "1.17",
"Counterfactual Reasoning":"1.05",
"Causal Reasoning":"0.96",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 287**


(continued from previous page)
"Future Prediction": "1.04"
}
}

When testing with 8 frames:

torchrun --nproc-per-node= 8 run.py --data MMBench-Video --model InternVL2-8B --verbose --
˓→nframe 8

The expected test results are:

{
"coarse_all": {
"CP":"1.36",
"FP-S": "1.26",
"FP-C": "1.07",
"HL":"0.32",
"LR":"0.83",
"AR":"1.19",
"RR":"1.33",
"CSR":"1.14",
"TR":"1.02",
"Perception":"1.22",
"Reasoning": "1.12",
"Overall":"1.19"
},
"coarse_valid": {
"CP":"1.36",
"FP-S": "1.26",
"FP-C": "1.07",
"HL":"0.32",
"LR":"0.83",
"AR":"1.19",
"RR":"1.33",
"CSR":"1.14",
"TR":"1.02",
"Perception":"1.22",
"Reasoning": "1.12",
"Overall":"1.19"
},
"fine_all": {
"Video Topic":"1.23",
"Video Emotion": "1.49",
"Video Scene":"1.22",
"Video Style":"1.67",
"OCR":"1.14",
"Object Recognition": "1.35",
"Attribute Recognition": "1.66",
"Event Recognition": "1.18",
"Human Motion": "0.90",
"Counting": "1.31",
"Spatial Relationship":"1.24",
(continues on next page)

**288 Chapter 1. Documentation**


(continued from previous page)
"Human-object Interaction":"1.05",
"Human Interaction": "1.02",
"Hallucination": "0.32",
"Structuralized Image-Text Understanding": "1.03",
"Mathematical Calculation":"0.53",
"Physical Property": "1.24",
"Function Reasoning": "1.05",
"Identity Reasoning": "1.26",
"Natural Relation":"1.00",
"Physical Relation": "1.27",
"Social Relation":"1.56",
"Common Sense Reasoning": "1.14",
"Counterfactual Reasoning":"0.95",
"Causal Reasoning":"1.07",
"Future Prediction": "0.98"
},
"fine_valid": {
"Video Topic":"1.23",
"Video Emotion": "1.49",
"Video Scene":"1.22",
"Video Style":"1.67",
"OCR":"1.14",
"Object Recognition": "1.35",
"Attribute Recognition": "1.66",
"Event Recognition": "1.18",
"Human Motion": "0.90",
"Counting": "1.31",
"Spatial Relationship":"1.24",
"Human-object Interaction":"1.05",
"Human Interaction": "1.02",
"Hallucination": "0.32",
"Structuralized Image-Text Understanding": "1.03",
"Mathematical Calculation":"0.53",
"Physical Property": "1.24",
"Function Reasoning": "1.05",
"Identity Reasoning": "1.26",
"Natural Relation":"1.00",
"Physical Relation": "1.27",
"Social Relation":"1.56",
"Common Sense Reasoning": "1.14",
"Counterfactual Reasoning":"0.95",
"Causal Reasoning":"1.07",
"Future Prediction": "0.98"
}
}

When testing with 16 frames:

torchrun --nproc-per-node= 8 run.py --data MMBench-Video --model InternVL2-8B --verbose --
˓→nframe 16

The expected test results are:

**1.23. Evaluation of InternVL2 Series 289**


### {

```
"coarse_all": {
"CP":"1.45",
"FP-S": "1.40",
"FP-C": "1.13",
"HL":"0.18",
"LR":"0.90",
"AR":"1.32",
"RR":"1.45",
"CSR":"1.19",
"TR":"1.04",
"Perception":"1.32",
"Reasoning": "1.18",
"Overall":"1.28"
},
"coarse_valid": {
"CP":"1.45",
"FP-S": "1.40",
"FP-C": "1.13",
"HL":"0.18",
"LR":"0.90",
"AR":"1.32",
"RR":"1.45",
"CSR":"1.19",
"TR":"1.04",
"Perception":"1.32",
"Reasoning": "1.18",
"Overall":"1.28"
},
"fine_all": {
"Video Topic":"1.38",
"Video Emotion": "1.57",
"Video Scene":"1.27",
"Video Style":"1.69",
"OCR":"1.32",
"Object Recognition": "1.40",
"Attribute Recognition": "1.80",
"Event Recognition": "1.18",
"Human Motion": "1.15",
"Counting": "1.44",
"Spatial Relationship":"1.22",
"Human-object Interaction":"1.15",
"Human Interaction": "1.03",
"Hallucination": "0.18",
"Structuralized Image-Text Understanding": "1.13",
"Mathematical Calculation":"0.56",
"Physical Property": "1.20",
"Function Reasoning": "1.05",
"Identity Reasoning": "1.72",
"Natural Relation":"0.93",
"Physical Relation": "1.45",
"Social Relation":"1.70",
"Common Sense Reasoning": "1.19",
(continues on next page)
```
**290 Chapter 1. Documentation**


(continued from previous page)
"Counterfactual Reasoning":"1.07",
"Causal Reasoning":"1.04",
"Future Prediction": "1.06"
},
"fine_valid": {
"Video Topic":"1.38",
"Video Emotion": "1.57",
"Video Scene":"1.27",
"Video Style":"1.69",
"OCR":"1.32",
"Object Recognition": "1.40",
"Attribute Recognition": "1.80",
"Event Recognition": "1.18",
"Human Motion": "1.15",
"Counting": "1.44",
"Spatial Relationship":"1.22",
"Human-object Interaction":"1.15",
"Human Interaction": "1.03",
"Hallucination": "0.18",
"Structuralized Image-Text Understanding": "1.13",
"Mathematical Calculation":"0.56",
"Physical Property": "1.20",
"Function Reasoning": "1.05",
"Identity Reasoning": "1.72",
"Natural Relation":"0.93",
"Physical Relation": "1.45",
"Social Relation":"1.70",
"Common Sense Reasoning": "1.19",
"Counterfactual Reasoning":"1.07",
"Causal Reasoning":"1.04",
"Future Prediction": "1.06"
}
}

When testing with 8 frames:

torchrun --nproc-per-node= 8 run.py --data MMBench-Video --model InternVL2-26B --verbose -
˓→-nframe 8

The expected test results are:

{
"coarse_all": {
"CP":"1.47",
"FP-S": "1.32",
"FP-C": "1.07",
"HL":"0.35",
"LR":"1.04",
"AR":"1.42",
"RR":"1.43",
"CSR":"1.16",
"TR":"1.04",
(continues on next page)

**1.23. Evaluation of InternVL2 Series 291**


```
(continued from previous page)
"Perception":"1.28",
"Reasoning": "1.22",
"Overall":"1.27"
},
"coarse_valid": {
"CP":"1.47",
"FP-S": "1.32",
"FP-C": "1.07",
"HL":"0.35",
"LR":"1.04",
"AR":"1.42",
"RR":"1.43",
"CSR":"1.16",
"TR":"1.04",
"Perception":"1.28",
"Reasoning": "1.22",
"Overall":"1.27"
},
"fine_all": {
"Video Topic":"1.35",
"Video Emotion": "1.47",
"Video Scene":"1.51",
"Video Style":"1.69",
"OCR":"1.21",
"Object Recognition": "1.37",
"Attribute Recognition": "1.82",
"Event Recognition": "1.16",
"Human Motion": "0.97",
"Counting": "1.43",
"Spatial Relationship":"1.20",
"Human-object Interaction":"1.05",
"Human Interaction": "1.02",
"Hallucination": "0.35",
"Structuralized Image-Text Understanding": "1.22",
"Mathematical Calculation":"0.76",
"Physical Property": "1.43",
"Function Reasoning": "1.29",
"Identity Reasoning": "1.55",
"Natural Relation":"1.33",
"Physical Relation": "1.12",
"Social Relation":"1.78",
"Common Sense Reasoning": "1.16",
"Counterfactual Reasoning":"1.05",
"Causal Reasoning":"1.05",
"Future Prediction": "1.06"
},
"fine_valid": {
"Video Topic":"1.35",
"Video Emotion": "1.47",
"Video Scene":"1.51",
"Video Style":"1.69",
"OCR":"1.21",
(continues on next page)
```
**292 Chapter 1. Documentation**


(continued from previous page)
"Object Recognition": "1.37",
"Attribute Recognition": "1.82",
"Event Recognition": "1.16",
"Human Motion": "0.97",
"Counting": "1.43",
"Spatial Relationship":"1.20",
"Human-object Interaction":"1.05",
"Human Interaction": "1.02",
"Hallucination": "0.35",
"Structuralized Image-Text Understanding": "1.22",
"Mathematical Calculation":"0.76",
"Physical Property": "1.43",
"Function Reasoning": "1.29",
"Identity Reasoning": "1.55",
"Natural Relation":"1.33",
"Physical Relation": "1.12",
"Social Relation":"1.78",
"Common Sense Reasoning": "1.16",
"Counterfactual Reasoning":"1.05",
"Causal Reasoning":"1.06",
"Future Prediction": "1.06"
}
}

When testing with 16 frames:

torchrun --nproc-per-node= 8 run.py --data MMBench-Video --model InternVL2-26B --verbose -
˓→-nframe 16

The expected test results are:

{
"coarse_all": {
"CP":"1.56",
"FP-S": "1.48",
"FP-C": "1.23",
"HL":"0.52",
"LR":"1.06",
"AR":"1.61",
"RR":"1.45",
"CSR":"1.38",
"TR":"1.23",
"Perception":"1.42",
"Reasoning": "1.35",
"Overall":"1.41"
},
"coarse_valid": {
"CP":"1.56",
"FP-S": "1.48",
"FP-C": "1.23",
"HL":"0.52",
"LR":"1.06",
(continues on next page)

**1.23. Evaluation of InternVL2 Series 293**


```
(continued from previous page)
"AR":"1.61",
"RR":"1.47",
"CSR":"1.38",
"TR":"1.23",
"Perception":"1.42",
"Reasoning": "1.35",
"Overall":"1.41"
},
"fine_all": {
"Video Topic":"1.52",
"Video Emotion": "1.48",
"Video Scene":"1.59",
"Video Style":"1.76",
"OCR":"1.37",
"Object Recognition": "1.55",
"Attribute Recognition": "1.91",
"Event Recognition": "1.30",
"Human Motion": "1.15",
"Counting": "1.46",
"Spatial Relationship":"1.18",
"Human-object Interaction":"1.35",
"Human Interaction": "1.08",
"Hallucination": "0.52",
"Structuralized Image-Text Understanding": "1.25",
"Mathematical Calculation":"0.78",
"Physical Property": "1.46",
"Function Reasoning": "1.42",
"Identity Reasoning": "1.96",
"Natural Relation":"1.44",
"Physical Relation": "1.06",
"Social Relation":"1.83",
"Common Sense Reasoning": "1.38",
"Counterfactual Reasoning":"1.25",
"Causal Reasoning":"1.23",
"Future Prediction": "1.17"
},
"fine_valid": {
"Video Topic":"1.52",
"Video Emotion": "1.48",
"Video Scene":"1.59",
"Video Style":"1.76",
"OCR":"1.38",
"Object Recognition": "1.56",
"Attribute Recognition": "1.91",
"Event Recognition": "1.30",
"Human Motion": "1.15",
"Counting": "1.46",
"Spatial Relationship":"1.18",
"Human-object Interaction":"1.35",
"Human Interaction": "1.08",
"Hallucination": "0.52",
"Structuralized Image-Text Understanding": "1.25",
(continues on next page)
```
**294 Chapter 1. Documentation**


(continued from previous page)
"Mathematical Calculation":"0.78",
"Physical Property": "1.46",
"Function Reasoning": "1.42",
"Identity Reasoning": "1.96",
"Natural Relation":"1.50",
"Physical Relation": "1.06",
"Social Relation":"1.83",
"Common Sense Reasoning": "1.38",
"Counterfactual Reasoning":"1.25",
"Causal Reasoning":"1.24",
"Future Prediction": "1.17"
}
}

When testing with 8 frames:

torchrun --nproc-per-node= 8 run.py --data MMBench-Video --model InternVL2-40B --verbose -
˓→-nframe 8

The expected test results are:

{
"coarse_all": {
"CP":"1.53",
"FP-S": "1.39",
"FP-C": "1.12",
"HL":"0.32",
"LR":"0.88",
"AR":"1.45",
"RR":"1.52",
"CSR":"1.15",
"TR":"1.13",
"Perception":"1.34",
"Reasoning": "1.25",
"Overall":"1.32"
},
"coarse_valid": {
"CP":"1.53",
"FP-S": "1.39",
"FP-C": "1.12",
"HL":"0.32",
"LR":"0.88",
"AR":"1.45",
"RR":"1.52",
"CSR":"1.15",
"TR":"1.13",
"Perception":"1.34",
"Reasoning": "1.25",
"Overall":"1.32"
},
"fine_all": {
"Video Topic":"1.57",
(continues on next page)

**1.23. Evaluation of InternVL2 Series 295**


```
(continued from previous page)
"Video Emotion": "1.65",
"Video Scene":"1.24",
"Video Style":"1.81",
"OCR":"1.29",
"Object Recognition": "1.40",
"Attribute Recognition": "1.80",
"Event Recognition": "1.21",
"Human Motion": "1.36",
"Counting": "1.45",
"Spatial Relationship":"1.22",
"Human-object Interaction":"1.14",
"Human Interaction": "1.02",
"Hallucination": "0.32",
"Structuralized Image-Text Understanding": "1.04",
"Mathematical Calculation":"0.62",
"Physical Property": "1.30",
"Function Reasoning": "1.33",
"Identity Reasoning": "1.74",
"Natural Relation":"1.30",
"Physical Relation": "1.35",
"Social Relation":"1.78",
"Common Sense Reasoning": "1.15",
"Counterfactual Reasoning":"1.18",
"Causal Reasoning":"1.14",
"Future Prediction": "1.13"
},
"fine_valid": {
"Video Topic":"1.57",
"Video Emotion": "1.65",
"Video Scene":"1.24",
"Video Style":"1.81",
"OCR":"1.29",
"Object Recognition": "1.40",
"Attribute Recognition": "1.80",
"Event Recognition": "1.21",
"Human Motion": "1.36",
"Counting": "1.45",
"Spatial Relationship":"1.22",
"Human-object Interaction":"1.14",
"Human Interaction": "1.02",
"Hallucination": "0.32",
"Structuralized Image-Text Understanding": "1.04",
"Mathematical Calculation":"0.62",
"Physical Property": "1.30",
"Function Reasoning": "1.33",
"Identity Reasoning": "1.74",
"Natural Relation":"1.30",
"Physical Relation": "1.35",
"Social Relation":"1.78",
"Common Sense Reasoning": "1.15",
"Counterfactual Reasoning":"1.18",
"Causal Reasoning":"1.14",
(continues on next page)
```
**296 Chapter 1. Documentation**


(continued from previous page)
"Future Prediction": "1.13"
}
}

When testing with 16 frames:

torchrun --nproc-per-node= 8 run.py --data MMBench-Video --model InternVL2-40B --verbose -
˓→-nframe 16

The expected test results are:

{
"coarse_all": {
"CP":"1.58",
"FP-S": "1.56",
"FP-C": "1.28",
"HL":"0.39",
"LR":"1.10",
"AR":"1.61",
"RR":"1.53",
"CSR":"1.25",
"TR":"1.20",
"Perception":"1.48",
"Reasoning": "1.35",
"Overall":"1.45"
},
"coarse_valid": {
"CP":"1.58",
"FP-S": "1.56",
"FP-C": "1.28",
"HL":"0.39",
"LR":"1.10",
"AR":"1.61",
"RR":"1.53",
"CSR":"1.25",
"TR":"1.20",
"Perception":"1.48",
"Reasoning": "1.35",
"Overall":"1.45"
},
"fine_all": {
"Video Topic":"1.57",
"Video Emotion": "1.67",
"Video Scene":"1.39",
"Video Style":"1.83",
"OCR":"1.47",
"Object Recognition": "1.64",
"Attribute Recognition": "2.03",
"Event Recognition": "1.32",
"Human Motion": "1.26",
"Counting": "1.49",
"Spatial Relationship":"1.31",
(continues on next page)

**1.23. Evaluation of InternVL2 Series 297**


(continued from previous page)
"Human-object Interaction":"1.30",
"Human Interaction": "1.26",
"Hallucination": "0.39",
"Structuralized Image-Text Understanding": "1.26",
"Mathematical Calculation":"0.84",
"Physical Property": "1.43",
"Function Reasoning": "1.49",
"Identity Reasoning": "1.92",
"Natural Relation":"1.56",
"Physical Relation": "1.27",
"Social Relation":"1.76",
"Common Sense Reasoning": "1.25",
"Counterfactual Reasoning":"1.27",
"Causal Reasoning":"1.19",
"Future Prediction": "1.15"
},
"fine_valid": {
"Video Topic":"1.57",
"Video Emotion": "1.67",
"Video Scene":"1.39",
"Video Style":"1.83",
"OCR":"1.47",
"Object Recognition": "1.64",
"Attribute Recognition": "2.03",
"Event Recognition": "1.32",
"Human Motion": "1.26",
"Counting": "1.49",
"Spatial Relationship":"1.31",
"Human-object Interaction":"1.30",
"Human Interaction": "1.26",
"Hallucination": "0.39",
"Structuralized Image-Text Understanding": "1.26",
"Mathematical Calculation":"0.84",
"Physical Property": "1.43",
"Function Reasoning": "1.49",
"Identity Reasoning": "1.92",
"Natural Relation":"1.56",
"Physical Relation": "1.27",
"Social Relation":"1.76",
"Common Sense Reasoning": "1.25",
"Counterfactual Reasoning":"1.27",
"Causal Reasoning":"1.19",
"Future Prediction": "1.15"
}
}

When testing with 8 frames:

torchrun --nproc-per-node= 1 run.py --data MMBench-Video --model InternVL2-76B --verbose -
˓→-nframe 8

The expected test results are:

**298 Chapter 1. Documentation**


### {

```
"coarse_all": {
"CP":"1.59",
"FP-S": "1.41",
"FP-C": "1.25",
"HL":"0.42",
"LR":"0.98",
"AR":"1.60",
"RR":"1.41",
"CSR":"1.44",
"TR":"1.27",
"Perception":"1.38",
"Reasoning": "1.35",
"Overall":"1.37"
},
"coarse_valid": {
"CP":"1.59",
"FP-S": "1.41",
"FP-C": "1.25",
"HL":"0.42",
"LR":"0.98",
"AR":"1.60",
"RR":"1.41",
"CSR":"1.44",
"TR":"1.27",
"Perception":"1.38",
"Reasoning": "1.35",
"Overall":"1.37"
},
"fine_all": {
"Video Topic":"1.51",
"Video Emotion": "1.66",
"Video Scene":"1.46",
"Video Style":"1.90",
"OCR":"1.32",
"Object Recognition": "1.45",
"Attribute Recognition": "1.78",
"Event Recognition": "1.30",
"Human Motion": "1.07",
"Counting": "1.49",
"Spatial Relationship":"1.36",
"Human-object Interaction":"1.27",
"Human Interaction": "1.21",
"Hallucination": "0.42",
"Structuralized Image-Text Understanding": "1.21",
"Mathematical Calculation":"0.64",
"Physical Property": "1.57",
"Function Reasoning": "1.51",
"Identity Reasoning": "1.72",
"Natural Relation":"1.33",
"Physical Relation": "1.33",
"Social Relation":"1.52",
"Common Sense Reasoning": "1.44",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 299**


(continued from previous page)
"Counterfactual Reasoning":"1.27",
"Causal Reasoning":"1.33",
"Future Prediction": "1.17"
},
"fine_valid": {
"Video Topic":"1.51",
"Video Emotion": "1.66",
"Video Scene":"1.46",
"Video Style":"1.90",
"OCR":"1.32",
"Object Recognition": "1.45",
"Attribute Recognition": "1.78",
"Event Recognition": "1.30",
"Human Motion": "1.07",
"Counting": "1.49",
"Spatial Relationship":"1.36",
"Human-object Interaction":"1.27",
"Human Interaction": "1.21",
"Hallucination": "0.42",
"Structuralized Image-Text Understanding": "1.21",
"Mathematical Calculation":"0.64",
"Physical Property": "1.57",
"Function Reasoning": "1.51",
"Identity Reasoning": "1.72",
"Natural Relation":"1.33",
"Physical Relation": "1.33",
"Social Relation":"1.52",
"Common Sense Reasoning": "1.44",
"Counterfactual Reasoning":"1.27",
"Causal Reasoning":"1.33",
"Future Prediction": "1.17"
}
}

When testing with 16 frames:

torchrun --nproc-per-node= 1 run.py --data MMBench-Video --model InternVL2-76B --verbose -
˓→-nframe 16

The expected test results are:

{
"coarse_all": {
"CP":"1.69",
"FP-S": "1.60",
"FP-C": "1.34",
"HL":"0.44",
"LR":"1.19",
"AR":"1.77",
"RR":"1.48",
"CSR":"1.51",
"TR":"1.36",
(continues on next page)

**300 Chapter 1. Documentation**


```
(continued from previous page)
"Perception":"1.54",
"Reasoning": "1.46",
"Overall":"1.52"
},
"coarse_valid": {
"CP":"1.69",
"FP-S": "1.60",
"FP-C": "1.34",
"HL":"0.44",
"LR":"1.19",
"AR":"1.77",
"RR":"1.48",
"CSR":"1.51",
"TR":"1.36",
"Perception":"1.54",
"Reasoning": "1.46",
"Overall":"1.52"
},
"fine_all": {
"Video Topic":"1.64",
"Video Emotion": "1.73",
"Video Scene":"1.60",
"Video Style":"1.93",
"OCR":"1.48",
"Object Recognition": "1.65",
"Attribute Recognition": "2.06",
"Event Recognition": "1.42",
"Human Motion": "1.39",
"Counting": "1.69",
"Spatial Relationship":"1.36",
"Human-object Interaction":"1.44",
"Human Interaction": "1.20",
"Hallucination": "0.44",
"Structuralized Image-Text Understanding": "1.40",
"Mathematical Calculation":"0.89",
"Physical Property": "1.65",
"Function Reasoning": "1.49",
"Identity Reasoning": "2.17",
"Natural Relation":"1.30",
"Physical Relation": "1.47",
"Social Relation":"1.59",
"Common Sense Reasoning": "1.51",
"Counterfactual Reasoning":"1.43",
"Causal Reasoning":"1.36",
"Future Prediction": "1.34"
},
"fine_valid": {
"Video Topic":"1.64",
"Video Emotion": "1.73",
"Video Scene":"1.60",
"Video Style":"1.93",
"OCR":"1.48",
(continues on next page)
```
**1.23. Evaluation of InternVL2 Series 301**


(continued from previous page)
"Object Recognition": "1.65",
"Attribute Recognition": "2.06",
"Event Recognition": "1.42",
"Human Motion": "1.39",
"Counting": "1.69",
"Spatial Relationship":"1.36",
"Human-object Interaction":"1.44",
"Human Interaction": "1.20",
"Hallucination": "0.44",
"Structuralized Image-Text Understanding": "1.40",
"Mathematical Calculation":"0.89",
"Physical Property": "1.65",
"Function Reasoning": "1.49",
"Identity Reasoning": "2.17",
"Natural Relation":"1.30",
"Physical Relation": "1.47",
"Social Relation":"1.59",
"Common Sense Reasoning": "1.51",
"Counterfactual Reasoning":"1.43",
"Causal Reasoning":"1.36",
"Future Prediction": "1.34"
}
}

**MathVision**

The MathVision (MATH-V) dataset is a comprehensive benchmark designed to evaluate the mathematical reasoning
capabilities of multimodal large models. This dataset includes 3,040 high-quality mathematical problems, each paired
with visual contexts sourced from real math competitions. It spans 16 distinct mathematical disciplines, including
algebra, geometry, topology, and graph theory, and is graded across ve levels of diculty. This setup provides a
diverse set of challenges that assess both the visual perception and reasoning abilities of models.

1B

2B

4B

8B

26B

40B

76B

torchrun--nproc-per-node= 8 run.py --model InternVL2- 1 B --data MathVision_MINI

The expected test results are:

-- ------------------------ --- --- -- -------- --------
0 Overall 304 100 37 32.8947 12.1711
1 algebra 19 5 1 26.3158 5.26316
2 analytic geometry 19 5 3 26.3158 15.7895
3 arithmetic 19 4 2 21.0526 10.5263
4 combinatorial geometry 19 7 2 36.8421 10.5263
(continues on next page)

**302 Chapter 1. Documentation**


(continued from previous page)
5 combinatorics 19 1 3 5.26316 15.7895
6 counting 19 1 2 5.26316 10.5263
7 descriptive geometry 19 10 4 52.6316 21.0526
8 graph theory 19 7 2 36.8421 10.5263
9 logic 19 6 3 31.5789 15.7895
10 metric geometry -angle 19 10 4 52.6316 21.0526
11 metric geometry -area 19 8 1 42.1053 5.26316
12 metric geometry -length 19 8 3 42.1053 15.7895
13 solid geometry 19 6 0 31.5789 0
14 statistics 19 6 2 31.5789 10.5263
15 topology 19 8 2 42.1053 10.5263
16 transformation geometry 19 8 3 42.1053 15.7895
-- ------------------------ --- --- -- -------- --------

torchrun--nproc-per-node= 8 run.py --model InternVL2- 2 B --data MathVision_MINI

The expected test results are:

-- ------------------------ --- --- -- -------- --------
0 Overall 304 100 48 32.8947 15.7895
1 algebra 19 6 1 31.5789 5.26316
2 analytic geometry 19 7 2 36.8421 10.5263
3 arithmetic 19 4 1 21.0526 5.26316
4 combinatorial geometry 19 5 5 26.3158 26.3158
5 combinatorics 19 1 1 5.26316 5.26316
6 counting 19 0 2 0 10.5263
7 descriptive geometry 19 8 4 42.1053 21.0526
8 graph theory 19 3 4 15.7895 21.0526
9 logic 19 9 5 47.3684 26.3158
10 metric geometry -angle 19 11 4 57.8947 21.0526
11 metric geometry -area 19 8 3 42.1053 15.7895
12 metric geometry -length 19 10 4 52.6316 21.0526
13 solid geometry 19 6 1 31.5789 5.26316
14 statistics 19 7 5 36.8421 26.3158
15 topology 19 5 1 26.3158 5.26316
16 transformation geometry 19 10 5 52.6316 26.3158
-- ------------------------ --- --- -- -------- --------

torchrun--nproc-per-node= 8 run.py --model InternVL2- 4 B --data MathVision_MINI

The expected test results are:

-- ------------------------ --- -- -- -------- --------
0 Overall 304 89 54 29.2763 17.7632
1 algebra 19 4 4 21.0526 21.0526
2 analytic geometry 19 7 4 36.8421 21.0526
3 arithmetic 19 1 4 5.26316 21.0526
4 combinatorial geometry 19 6 2 31.5789 10.5263
5 combinatorics 19 1 2 5.26316 10.5263
6 counting 19 0 5 0 26.3158
7 descriptive geometry 19 8 5 42.1053 26.3158
8 graph theory 19 6 2 31.5789 10.5263
(continues on next page)

**1.23. Evaluation of InternVL2 Series 303**


(continued from previous page)
9 logic 19 8 2 42.1053 10.5263
10 metric geometry -angle 19 10 6 52.6316 31.5789
11 metric geometry -area 19 7 5 36.8421 26.3158
12 metric geometry -length 19 11 2 57.8947 10.5263
13 solid geometry 19 7 2 36.8421 10.5263
14 statistics 19 4 5 21.0526 26.3158
15 topology 19 6 1 31.5789 5.26316
16 transformation geometry 19 3 3 15.7895 15.7895
-- ------------------------ --- -- -- -------- --------

torchrun--nproc-per-node= 8 run.py --model InternVL2- 8 B --data MathVision_MINI

The expected test results are:

-- ------------------------ --- --- -- -------- -------
0 Overall 304 104 62 34.2105 20.3947
1 algebra 19 4 4 21.0526 21.0526
2 analytic geometry 19 4 3 21.0526 15.7895
3 arithmetic 19 2 4 10.5263 21.0526
4 combinatorial geometry 19 9 6 47.3684 31.5789
5 combinatorics 19 1 3 5.26316 15.7895
6 counting 19 2 4 10.5263 21.0526
7 descriptive geometry 19 11 4 57.8947 21.0526
8 graph theory 19 6 2 31.5789 10.5263
9 logic 19 10 2 52.6316 10.5263
10 metric geometry -angle 19 7 4 36.8421 21.0526
11 metric geometry -area 19 7 7 36.8421 36.8421
12 metric geometry -length 19 7 2 36.8421 10.5263
13 solid geometry 19 8 4 42.1053 21.0526
14 statistics 19 6 4 31.5789 21.0526
15 topology 19 11 5 57.8947 26.3158
16 transformation geometry 19 9 4 47.3684 21.0526
-- ------------------------ --- --- -- -------- -------

torchrun--nproc-per-node= 8 run.py --model InternVL2- 26 B--data MathVision_MINI

The expected test results are:

-- ------------------------ --- --- -- -------- --------
0 Overall 304 105 71 34.5395 23.3553
1 algebra 19 6 3 31.5789 15.7895
2 analytic geometry 19 6 7 31.5789 36.8421
3 arithmetic 19 4 4 21.0526 21.0526
4 combinatorial geometry 19 4 3 21.0526 15.7895
5 combinatorics 19 4 6 21.0526 31.5789
6 counting 19 1 3 5.26316 15.7895
7 descriptive geometry 19 7 4 36.8421 21.0526
8 graph theory 19 5 5 26.3158 26.3158
9 logic 19 11 7 57.8947 36.8421
10 metric geometry -angle 19 9 3 47.3684 15.7895
11 metric geometry -area 19 9 7 47.3684 36.8421
12 metric geometry -length 19 10 3 52.6316 15.7895
(continues on next page)

**304 Chapter 1. Documentation**


```
(continued from previous page)
```
13 solid geometry 19 6 1 31.5789 5.26316
14 statistics 19 8 7 42.1053 36.8421
15 topology 19 10 5 52.6316 26.3158
16 transformation geometry 19 5 3 26.3158 15.7895
-- ------------------------ --- --- -- -------- --------

torchrun--nproc-per-node= 8 run.py --model InternVL2- 40 B--data MathVision_MINI

The expected test results are:

-- ------------------------ --- --- -- -------- -------
0 Overall 304 100 65 32.8947 21.3816
1 algebra 19 6 4 31.5789 21.0526
2 analytic geometry 19 7 5 36.8421 26.3158
3 arithmetic 19 4 8 21.0526 42.1053
4 combinatorial geometry 19 3 6 15.7895 31.5789
5 combinatorics 19 0 4 0 21.0526
6 counting 19 1 2 5.26316 10.5263
7 descriptive geometry 19 8 2 42.1053 10.5263
8 graph theory 19 6 3 31.5789 15.7895
9 logic 19 8 4 42.1053 21.0526
10 metric geometry -angle 19 10 5 52.6316 26.3158
11 metric geometry -area 19 8 2 42.1053 10.5263
12 metric geometry -length 19 10 3 52.6316 15.7895
13 solid geometry 19 6 3 31.5789 15.7895
14 statistics 19 10 6 52.6316 31.5789
15 topology 19 7 4 36.8421 21.0526
16 transformation geometry 19 6 4 31.5789 21.0526
-- ------------------------ --- --- -- -------- -------

torchrun--nproc-per-node= 1 run.py --model InternVL2- 76 B--data MathVision_MINI

The expected test results are:

-- ------------------------ --- --- -- -------- -------
0 Overall 304 102 72 33.5526 23.6842
1 algebra 19 1 3 5.26316 15.7895
2 analytic geometry 19 6 8 31.5789 42.1053
3 arithmetic 19 5 7 26.3158 36.8421
4 combinatorial geometry 19 7 2 36.8421 10.5263
5 combinatorics 19 1 4 5.26316 21.0526
6 counting 19 0 3 0 15.7895
7 descriptive geometry 19 9 2 47.3684 10.5263
8 graph theory 19 6 3 31.5789 15.7895
9 logic 19 8 5 42.1053 26.3158
10 metric geometry -angle 19 11 5 57.8947 26.3158
11 metric geometry -area 19 9 5 47.3684 26.3158
12 metric geometry -length 19 10 5 52.6316 26.3158
13 solid geometry 19 6 5 31.5789 26.3158
14 statistics 19 6 8 31.5789 42.1053
15 topology 19 7 4 36.8421 21.0526
16 transformation geometry 19 10 3 52.6316 15.7895
(continues on next page)

**1.23. Evaluation of InternVL2 Series 305**


```
(continued from previous page)
```
-- ------------------------ --- --- -- -------- -------

### BLINK

The BLINK dataset is a new benchmark designed to challenge MLLMs by focusing on core visual perception tasks that
are not typically covered by other benchmarks. It reformats 14 classic computer vision tasks into 3,807 multiple-choice
questions, paired with single or multiple images and visual prompts. These tasks include relative depth estimation,
visual correspondence, forensics detection, and multi-view reasoning, which humans can generally solve quickly but
are signicantly challenging for current multimodal LLMs.

1B

2B

4B

8B

26B

40B

76B

torchrun--nproc-per-node= 8 run.py --model InternVL2- 1 B --data BLINK

The expected test results are:

2024-08-02 13:47:04,164 - RUN - INFO - The evaluation of model InternVL2-1B x dataset␣
˓→BLINK has finished!
2024-08-02 13:47:04,164 - RUN - INFO - Evaluation Results:
2024-08-02 13:47:04,166 - RUN - INFO -
------------------------- -------------------
split none
Overall 0.3855865334034719
Art_Style 0.4700854700854701
Counting 0.325
Forensic_Detection 0.25
Functional_Correspondence 0.26153846153846155
IQ_Test 0.2866666666666667
Jigsaw 0.5266666666666666
Multi-view_Reasoning 0.44360902255639095
Object_Localization 0.4918032786885246
Relative_Depth 0.49193548387096775
Relative_Reflectance 0.3283582089552239
Semantic_Correspondence 0.2446043165467626
Spatial_Relation 0.5664335664335665
Visual_Correspondence 0.27325581395348836
Visual_Similarity 0.4740740740740741
------------------------- -------------------

torchrun--nproc-per-node= 8 run.py --model InternVL2- 2 B --data BLINK

The expected test results are:

**306 Chapter 1. Documentation**


2024-08-02 13:46:22,686 - RUN - INFO - The evaluation of model InternVL2-2B x dataset␣
˓→BLINK has finished!
2024-08-02 13:46:22,686 - RUN - INFO - Evaluation Results:
2024-08-02 13:46:22,689 - RUN - INFO -
------------------------- -------------------
split none
Overall 0.43766438716465017
Art_Style 0.5299145299145299
Counting 0.4666666666666667
Forensic_Detection 0.2803030303030303
Functional_Correspondence 0.23076923076923078
IQ_Test 0.2866666666666667
Jigsaw 0.47333333333333333
Multi-view_Reasoning 0.556390977443609
Object_Localization 0.36885245901639346
Relative_Depth 0.6048387096774194
Relative_Reflectance 0.39552238805970147
Semantic_Correspondence 0.3669064748201439
Spatial_Relation 0.7622377622377622
Visual_Correspondence 0.3313953488372093
Visual_Similarity 0.5111111111111111
------------------------- -------------------

torchrun--nproc-per-node= 8 run.py --model InternVL2- 4 B --data BLINK

The expected test results are:

2024-08-02 13:34:06,982 - RUN - INFO - The evaluation of model InternVL2-4B x dataset␣
˓→BLINK has finished!
2024-08-02 13:34:06,982 - RUN - INFO - Evaluation Results:
2024-08-02 13:34:06,984 - RUN - INFO -
------------------------- -------------------
split none
Overall 0.46081009994739613
Art_Style 0.5897435897435898
Counting 0.55
Forensic_Detection 0.32575757575757575
Functional_Correspondence 0.25384615384615383
IQ_Test 0.23333333333333334
Jigsaw 0.48
Multi-view_Reasoning 0.556390977443609
Object_Localization 0.5245901639344263
Relative_Depth 0.6370967741935484
Relative_Reflectance 0.3283582089552239
Semantic_Correspondence 0.2805755395683453
Spatial_Relation 0.8111888111888111
Visual_Correspondence 0.36046511627906974
Visual_Similarity 0.5925925925925926
------------------------- -------------------

torchrun--nproc-per-node= 8 run.py --model InternVL2- 8 B --data BLINK

The expected test results are:

**1.23. Evaluation of InternVL2 Series 307**


2024-08-02 13:28:10,915 - RUN - INFO - The evaluation of model InternVL2-8B x dataset␣
˓→BLINK has finished!
2024-08-02 13:28:10,915 - RUN - INFO - Evaluation Results:
2024-08-02 13:28:10,917 - RUN - INFO -
------------------------- -------------------
split none
Overall 0.5086796422935297
Art_Style 0.7094017094017094
Counting 0.75
Forensic_Detection 0.3484848484848485
Functional_Correspondence 0.17692307692307693
IQ_Test 0.30666666666666664
Jigsaw 0.5466666666666666
Multi-view_Reasoning 0.48872180451127817
Object_Localization 0.5573770491803278
Relative_Depth 0.7419354838709677
Relative_Reflectance 0.39552238805970147
Semantic_Correspondence 0.26618705035971224
Spatial_Relation 0.7972027972027972
Visual_Correspondence 0.36046511627906974
Visual_Similarity 0.7851851851851852
------------------------- -------------------

torchrun--nproc-per-node= 8 run.py --model InternVL2- 26 B--data BLINK

The expected test results are:

2024-08-02 13:00:51,453 - RUN - INFO - The evaluation of model InternVL2-26B x dataset␣
˓→BLINK has finished!
2024-08-02 13:00:51,453 - RUN - INFO - Evaluation Results:
2024-08-02 13:00:51,455 - RUN - INFO -
------------------------- -------------------
split none
Overall 0.5623356128353498
Art_Style 0.7606837606837606
Counting 0.675
Forensic_Detection 0.45454545454545453
Functional_Correspondence 0.3
IQ_Test 0.30666666666666664
Jigsaw 0.7466666666666667
Multi-view_Reasoning 0.41353383458646614
Object_Localization 0.5737704918032787
Relative_Depth 0.782258064516129
Relative_Reflectance 0.3582089552238806
Semantic_Correspondence 0.4172661870503597
Spatial_Relation 0.8461538461538461
Visual_Correspondence 0.47674418604651164
Visual_Similarity 0.8222222222222222
------------------------- -------------------

torchrun--nproc-per-node= 8 run.py --model InternVL2- 40 B--data BLINK

The expected test results are:

**308 Chapter 1. Documentation**


2024-08-02 14:03:54,291 - RUN - INFO - The evaluation of model InternVL2-40B x dataset␣
˓→BLINK has finished!
2024-08-02 14:03:54,291 - RUN - INFO - Evaluation Results:
2024-08-02 14:03:54,292 - RUN - INFO -
------------------------- -------------------
split none
Overall 0.5718043135192005
Art_Style 0.6923076923076923
Counting 0.7166666666666667
Forensic_Detection 0.44696969696969696
Functional_Correspondence 0.25384615384615383
IQ_Test 0.22666666666666666
Jigsaw 0.8
Multi-view_Reasoning 0.5639097744360902
Object_Localization 0.5819672131147541
Relative_Depth 0.7903225806451613
Relative_Reflectance 0.3880597014925373
Semantic_Correspondence 0.41007194244604317
Spatial_Relation 0.8461538461538461
Visual_Correspondence 0.4941860465116279
Visual_Similarity 0.8518518518518519
------------------------- -------------------

torchrun--nproc-per-node= 1 run.py --model InternVL2- 76 B--data BLINK

The expected test results are:

2024-08-02 16:08:58,199 - RUN - INFO - The evaluation of model InternVL2-76B x dataset␣
˓→BLINK has finished!
2024-08-02 16:08:58,199 - RUN - INFO - Evaluation Results:
2024-08-02 16:08:58,200 - RUN - INFO -
------------------------- -------------------
split none
Overall 0.5681220410310363
Art_Style 0.6581196581196581
Counting 0.7
Forensic_Detection 0.42424242424242425
Functional_Correspondence 0.3
IQ_Test 0.2733333333333333
Jigsaw 0.74
Multi-view_Reasoning 0.5639097744360902
Object_Localization 0.5245901639344263
Relative_Depth 0.782258064516129
Relative_Reflectance 0.30597014925373134
Semantic_Correspondence 0.4028776978417266
Spatial_Relation 0.8391608391608392
Visual_Correspondence 0.6802325581395349
Visual_Similarity 0.7555555555555555
------------------------- -------------------

**1.23. Evaluation of InternVL2 Series 309**


### MTVQA

MTVQA (Multilingual Text-Centric Visual Question Answering) introduces high-quality human expert annotations
across nine diverse languages to address multilingual TEC-VQA challenges, enhancing AI models’ performance in
text-centric visual environments.

1B

2B

4B

8B

26B

40B

76B

torchrun--nproc-per-node= 8 run.py --model InternVL2- 1 B --data MTVQA_TEST

The expected test results are:

{
"AR":1.991465149359886,
"Average":12.570079669519032,
"DE":21.85114503816794,
"FR":20.54176072234763,
"IT":22.39819004524887,
"JA":6.159420289855073,
"KR":8.422939068100359,
"RU":3.571428571428571,
"TH":2.1645021645021645,
"VI":11.199095022624435
}

torchrun--nproc-per-node= 8 run.py --model InternVL2- 2 B --data MTVQA_TEST

The expected test results are:

{
"AR":1.422475106685633,
"Average":10.88816760106226,
"DE":15.744274809160306,
"FR":19.751693002257337,
"IT":21.380090497737555,
"JA":7.367149758454106,
"KR":5.913978494623656,
"RU":3.0423280423280423,
"TH":0.8658008658008658,
"VI":9.049773755656108
}

torchrun--nproc-per-node= 8 run.py --model InternVL2- 4 B --data MTVQA_TEST

The expected test results are:

**310 Chapter 1. Documentation**


### {

### "AR":1.849217638691323,

"Average":15.34375922100915,
"DE":24.904580152671755,
"FR":30.81264108352145,
"IT":26.923076923076923,
"JA":8.091787439613526,
"KR":8.064516129032258,
"RU":3.7037037037037033,
"TH":3.463203463203463,
"VI":12.104072398190045
}

torchrun--nproc-per-node= 8 run.py --model InternVL2- 8 B --data MTVQA_TEST

The expected test results are:

{
"AR":2.418207681365576,
"Average":18.102685157863675,
"DE":28.435114503816795,
"FR":33.972911963882616,
"IT":30.20361990950226,
"JA":8.57487922705314,
"KR":10.931899641577061,
"RU":5.158730158730158,
"TH":6.926406926406926,
"VI":17.760180995475114
}

torchrun--nproc-per-node= 8 run.py --model InternVL2- 26 B--data MTVQA_TEST

The expected test results are:

{
"AR":3.982930298719772,
"Average":17.71909117733845,
"DE":28.053435114503817,
"FR":26.52370203160271,
"IT":30.316742081447963,
"JA":9.903381642512077,
"KR":11.29032258064516,
"RU":6.613756613756613,
"TH":8.225108225108226,
"VI":18.32579185520362
}

torchrun--nproc-per-node= 8 run.py --model InternVL2- 40 B--data MTVQA_TEST

The expected test results are:

{
"AR":4.551920341394026,
(continues on next page)

**1.23. Evaluation of InternVL2 Series 311**


(continued from previous page)
"Average":20.61079964591325,
"DE":30.62977099236641,
"FR":36.455981941309254,
"IT":34.61538461538461,
"JA":10.748792270531402,
"KR":13.261648745519713,
"RU":6.481481481481481,
"TH":5.627705627705628,
"VI":21.49321266968326
}

torchrun--nproc-per-node= 1 run.py --model InternVL2- 76 B--data MTVQA_TEST

The expected test results are:

{
"AR":9.53058321479374,
"Average":22.794334611979934,
"DE":31.297709923664126,
"FR":35.66591422121896,
"IT":35.18099547511312,
"JA":11.11111111111111,
"KR":14.336917562724013,
"RU":11.904761904761903,
"TH":9.956709956709958,
"VI":26.923076923076923
}

**1.23.4 Citation**

If yound this project useful in your research, please consider citing:

@article{chen2024far,
title={How far are we to gpt-4v? closing the gap to commercial multimodal models with␣
˓→open-source suites},
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei␣
˓→and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and␣
˓→others},
journal={Science China Information Sciences},
volume={67},
number={12},
pages={220101},
year={2024},
publisher={Springer}
}
@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
(continues on next page)

**312 Chapter 1. Documentation**


(continued from previous page)
˓→Recognition},
pages={24185--24198},
year={2024}
}

## 1.24 Deploy InternVL2 Series

**1.24.1 LMDeploy**

LMDeployis a toolkit for compressing, deploying, and serving LLMs & VLMs.

pip install lmdeploy>= 0 .5.3

LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-
use pipeline, similar to the Large Language Model (LLM) inference pipeline.

**A ‘Hello, World’ Example**

1B

2B

4B

8B

26B

40B

76B

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2-1B'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))
response= pipe(('describe this image', image))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2-2B'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))
response= pipe(('describe this image', image))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
(continues on next page)

**1.24. Deploy InternVL2 Series 313**


```
(continued from previous page)
```
model= 'OpenGVLab/InternVL2-4B'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))
response= pipe(('describe this image', image))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2-8B'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))
response= pipe(('describe this image', image))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2-26B'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))
response= pipe(('describe this image', image))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2-40B'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))
response= pipe(('describe this image', image))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2-Llama3-76B'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 , tp= 4 ))
response= pipe(('describe this image', image))
print(response.text)

If ImportError occurs while executing this case, please install the required packages as prompted.

**314 Chapter 1. Documentation**


**Multi-Images Inference**

When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a
higher number of input tokens, and as a result, the size of the context window typically needs to be increased.

```
Warning: Due to the scarcity of multi-image conversation data, the performance on multi-image tasks may
be unstable, and it may require multiple attempts to achieve satisfactory results.
```
1B

2B

4B

8B

26B

40B

76B

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/InternVL2-1B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/InternVL2-2B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
(continues on next page)

**1.24. Deploy InternVL2 Series 315**


(continued from previous page)
˓→images', images))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/InternVL2-4B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/InternVL2-8B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/InternVL2-26B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
(continues on next page)

**316 Chapter 1. Documentation**


(continued from previous page)
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/InternVL2-40B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/InternVL2-Llama3-76B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 , tp= 4 ))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

**1.24. Deploy InternVL2 Series 317**


**Batch Prompts Inference**

Conducting inference with batch prompts is quite straightforward; just place them within a list structure:

1B

2B

4B

8B

26B

40B

76B

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2-1B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2-2B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2-4B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
(continues on next page)

**318 Chapter 1. Documentation**


(continued from previous page)
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2-8B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2-26B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2-40B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

**1.24. Deploy InternVL2 Series 319**


fromlmdeployimport pipeline, TurbomindEngineConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2-Llama3-76B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 , tp= 4 ))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

**Multi-Turn Conversation**

There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the
format of OpenAI and use above introduced method, the other is to use the pipeline.chat interface.

1B

2B

4B

8B

26B

40B

76B

fromlmdeployimport pipeline, TurbomindEngineConfig, GenerationConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2-1B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig, GenerationConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2-2B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
(continues on next page)

**320 Chapter 1. Documentation**


```
(continued from previous page)
```
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig, GenerationConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2-4B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig, GenerationConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2-8B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig, GenerationConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2-26B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig, GenerationConfig
fromlmdeploy.vlimport load_image

```
(continues on next page)
```
**1.24. Deploy InternVL2 Series 321**


```
(continued from previous page)
```
model= 'OpenGVLab/InternVL2-40B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 ))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig, GenerationConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL2-Llama3-76B'
pipe= pipeline(model, backend_config=TurbomindEngineConfig(session_len= 8192 , tp= 4 ))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

**Serving**

**Launch Service**

1B

2B

4B

8B

26B

40B

76B

LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

lmdeploy serve api_server OpenGVLab/InternVL2-1B --backend turbomind --server-port 23333

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

lmdeploy serve api_server OpenGVLab/InternVL2-2B --backend turbomind --server-port 23333

**322 Chapter 1. Documentation**


You can also load 4-bit AWQ quantized models to save memory:

lmdeploy serve api_server OpenGVLab/InternVL2-2B-AWQ --backend turbomind --server-port␣
˓→ 23333 --model-format awq

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

lmdeploy serve api_server OpenGVLab/InternVL2-4B --backend turbomind --server-port 23333

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

lmdeploy serve api_server OpenGVLab/InternVL2-8B --backend turbomind --server-port 23333

You can also load 4-bit AWQ quantized models to save memory:

lmdeploy serve api_server OpenGVLab/InternVL2-8B-AWQ --backend turbomind --server-port␣
˓→ 23333 --model-format awq

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

```
Warning: Please make sure to install Flash Attention; otherwise, using --tp will cause errors.
```
LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

lmdeploy serve api_server OpenGVLab/InternVL2-26B --backend turbomind --server-port 23333

You can also load 4-bit AWQ quantized models to save memory:

lmdeploy serve api_server OpenGVLab/InternVL2-26B-AWQ --backend turbomind --server-port␣
˓→ 23333 --model-format awq

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

```
Warning: Please make sure to install Flash Attention; otherwise, using --tp will cause errors.
```
LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

lmdeploy serve api_server OpenGVLab/InternVL2-40B --backend turbomind --server-port␣
˓→ 23333 --tp 2

You can also load 4-bit AWQ quantized models to save memory:

**1.24. Deploy InternVL2 Series 323**


lmdeploy serve api_server OpenGVLab/InternVL2-40B-AWQ --backend turbomind --server-port␣
˓→ 23333 --model-format awq

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

```
Warning: Please make sure to install Flash Attention; otherwise, using --tp will cause errors.
```
LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

CUDA_VISIBLE_DEVICES= 0 ,1,2,3 lmdeploy serve api_server OpenGVLab/InternVL2-Llama3-76B --
˓→backend turbomind --server-port 23333 --tp 4

You can also load 4-bit AWQ quantized models to save memory:

lmdeploy serve api_server OpenGVLab/InternVL2-Llama3-76B-AWQ --backend turbomind --
˓→server-port 23333 --model-format awq

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

**Integrate with** OpenAI

Here is an example of interaction with the endpoint v1/chat/completions service via the openai package. Before
running it, please install the openai package by pip install openai.

fromopenai import OpenAI

client =OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name =client.models.list().data[ 0 ].id
response= client.chat.completions.create(
model=model_name,
messages=[{
'role':
'user',
'content': [{
'type': 'text',
'text': 'describe this image',
}, {
'type': 'image_url',
'image_url': {
'url':
'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
},
}],
}],
temperature=0.8,
top_p=0.8)
print(response)

If you encounter any issues or need advanced usage with lmdeploy, we recommend reading thelmdeploy documen-
tation.

**324 Chapter 1. Documentation**


**Memory Usage Testing**

To test the memory usage with several A100 GPUs, we will consider the following variables: the number of GPUs,
whether AWQ 4-bit quantization is used, and the size of --cache-max-entry-count. The table below shows the
memory usage per GPU under dierent scenarios:

2B

4B

8B

26B

40B

76B

```
#GPUs AWQ 4-bit cache-max-entry-count Memory Usage per GPU
1 No 0.8 67140 MB
1 No 0.2 21284 MB
1 No 0.1 13700 MB
1 No 0.05 9860 MB
2 No 0.05 8612 MB
4 No 0.05 7916 MB
```
```
1 Yes 0.2 19242 MB
1 Yes 0.05 7850 MB
```
```
#GPUs AWQ 4-bit cache-max-entry-count Memory Usage per GPU
1 No 0.8 67666 MB
1 No 0.2 24914 MB
1 No 0.1 17746 MB
1 No 0.05 14162 MB
2 No 0.05 11700 MB
4 No 0.05 10216 MB
```
```
#GPUs AWQ 4-bit cache-max-entry-count Memory Usage per GPU
1 No 0.8 69708 MB
1 No 0.2 30624 MB
1 No 0.1 24108 MB
1 No 0.05 20832 MB
2 No 0.05 14570 MB
4 No 0.05 11378 MB
```
```
1 Yes 0.2 22440 MB
1 Yes 0.05 11528 MB
```
```
Warning: Please make sure to install Flash Attention; otherwise, using --tp will cause errors.
```
**1.24. Deploy InternVL2 Series 325**


```
#GPUs AWQ 4-bit cache-max-entry-count Memory Usage per GPU
1 No 0.8 77310 MB
1 No 0.2 58302 MB
2 No 0.2 39750 MB
4 No 0.2 30512 MB
8 No 0.2 25806 MB
```
```
1 Yes 0.8 72104 MB
1 Yes 0.2 37448 MB
1 Yes 0.1 31656 MB
1 Yes 0.05 28712 MB
```
```
Warning: Please make sure to install Flash Attention; otherwise, using --tp will cause errors.
```
```
#GPUs AWQ 4-bit cache-max-entry-count Memory Usage per GPU
1 No 0.8 Out-Of-Memory
1 No 0.2 79892 MB
1 No 0.1 No output
1 No 0.05 No output
2 No 0.2 50156 MB
4 No 0.2 34990 MB
8 No 0.2 28052 MB
```
```
1 Yes 0.8 72964 MB
1 Yes 0.2 42628 MB
1 Yes 0.1 37572 MB
1 Yes 0.05 33636 MB
```
```
Warning: Please make sure to install Flash Attention; otherwise, using --tp will cause errors.
```
```
#GPUs AWQ 4-bit cache-max-entry-count Memory Usage per GPU
1 No 0.8 Out-Of-Memory
1 No 0.2 Out-Of-Memory
2 No 0.2 78698 MB
4 No 0.2 50477 MB
8 No 0.2 36268 MB
```
```
1 Yes 0.8 77138 MB
1 Yes 0.2 58738 MB
1 Yes 0.1 55698 MB
1 Yes 0.05 54130 MB
```
## 1.25 Domain Adaptation

**1.25.1 Multi-View Image-Based Autonomous Driving**

**326 Chapter 1. Documentation**


**Data Preparation**

- Prepare InternVL-Chat-V1-2-SFT-Data, See _Document_.
- Download drivelm_train.jsonl and drivelm_val.jsonl fromInternVL-Domain-Adaptation-Data.
    drivelm_train.jsonl and drivelm_val.jsonl are the data after format conversion.
- Download the images fromDriveLMand process the images using tools/images_stitching.py:

python tools/images_stitching.py --data-root InternVL-Domain-Adaptation-Data/images/
˓→drivelm --ann-file path/to/v1_1_val_nus_q_only.json

- Download autonomous driving subset ofmme-realworld.
- Organize theles according to the following structure.

```
path/to/internvl_chat/InternVL-Domain-Adaptation-Data
train_data
drivelm_train.jsonl
images
MME-RealWorld
| | data/AutonomousDriving/
| drivelm
| nuscenes/
| stitch/
train_meta
| internvl_1_2_finetune_drivelm.json
val
MME_RealWorld.json
drivelm_val.jsonl
```
**Finetune**

After downloading the pre-trained model and preparing the training data, you can adapte the model using following
scripts.

Beforene-tuning, set the --model_name_or_path to the path of the path of the pre-trained model.

In the default settings, we conduct full-parameterne-tuning, but you can optionally freeze the visual encoder depend-
ing on your computational resources.

Mini-InternVL-1B

Mini-InternVL-2B

Mini-InternVL-4B

GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/mini_internvl/domain_adaptation/internvl2_1b_
˓→qwen2_0_5b_dynamic_res_finetune_drivelm.sh

GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/mini_internvl/domain_adaptation/internvl2_2b_
˓→internlm2_1_8b_dynamic_res_finetune_drivelm.sh

GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/mini_internvl/domain_adaptation/internvl2_4b_
˓→phi3_3_8b_dynamic_res_finetune_drivelm.sh

**1.25. Domain Adaptation 327**


**Evaluation**

- DriveLM Challenge

This dataset contains data for perception, prediction, and planning, providing a comprehensive view of autonomous
driving scenarios. To test ourne-tuned model on the DriveLM Challenge, we have already pre-processed the data,
including both images and annotations. You can now directly use the following command to run the test with 8 GPUs:

GPUS= 8 sh evaluate.sh${checkpoint} drivelm

- MME-Realworld-AD

MME-Realworldcontains a subset of autonomous driving scenes, on which we assess the model’s performance on
_perception_ and _reasoning_ tasks. Please use the following command to perform the test with 8 GPU:

GPUS= 8 sh evaluate.sh${checkpoint} mme—realworld --dynamic --max-num 12 --subtask ␣
˓→Autonomous_Driving

**1.25.2 Medical Images**

**Data Preparation**

- Prepare _InternVL-Chat-V1-2-SFT-Data_ , See _Document_
- Download the following les fromInternVL-Domain-Adaptation-Data, extract the images, and organize them
    into the following directory structure.

path/to/internvl_chat/InternVL-Domain-Adaptation-Data
train_data
medical_sft_sample500k.jsonl
images
medical_images
train_meta
internvl_1_2_finetune_medical.json

**Finetune**

Pleasenetune the model using following scripts:

1B

2B

4B

GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/mini_internvl/domain_adaptation/internvl2_1b_
˓→qwen2_0_5b_dynamic_res_finetune_medical.sh

GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/mini_internvl/domain_adaptation/internvl2_2b_
˓→internlm2_1_8b_dynamic_res_finetune_medical.sh

GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/mini_internvl/domain_adaptation/internvl2_4b_
˓→phi3_3_8b_dynamic_res_finetune_medical.sh

**328 Chapter 1. Documentation**


**Evaluation**

we test our model on a comprehensive medical AI benchmark,GMAI-MMBench. Our evaluation was conducted using
theVLMEvalKitframework.

Please refer toDocumentfor testing.

Importantly, before testing, please add the model to the internvl_series incong_le:

```
'Mini-InternVL-DA-1B': partial(InternVLChat, model_path='path/to/your/checkpoints',␣
˓→version='V2.0'),
'Mini-InternVL-DA-2B': partial(InternVLChat, model_path='path/to/your/checkpoints',␣
˓→version='V2.0'),
'Mini-InternVL-DA-4B': partial(InternVLChat, model_path='path/to/your/checkpoints',␣
˓→version='V2.0')
```
**1.25.3 Remote Sensing**

**Data Preparation**

- Prepare _InternVL-Chat-V1-2-SFT-Data_ , See _Document_
- Please download the correspondingles in train_data, train_meta, and val directories fromInternVL-Domain-
    Adaptation-Data, following the directory tree structure below.
- Download the images fromGeoChat,FIT-RS,RSVQAandDIOR-RSVG. Extract theles and place them in the
    corresponding locations within the directory structure below.

path/to/internvl_chat/InternVL-Domain-Adaptation-Data
train_data
dior_rsvg_instruct_26k.jsonl
| fit_rs_vqa_100k.jsonl
| rsvqa_hr_train_instruct_100k.jsonl
geochat_instruct.jsonl
images
| RSVQA_L
| | Images_LR
| RSVQA-H
| | Data
| DIOR-RSVG
| | JPEGImages
| FIT-RS
| | imgv2_split_512_100_vaild
| GeoChat
| images
| final_images_llava
train_meta
| internvl_1_2_finetune_remote.json
val
dior_rsvg_test.json
rsvqa_h_test_1_instruct.json
rsvqa_h_test_2_instruct.json
rsvqa_l_test_instruct.json

**1.25. Domain Adaptation 329**


**Finetune**

Pleasenetune the model using following scripts:

1B

2B

4B

GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/mini_internvl/domain_adaptation/internvl2_1b_
˓→qwen2_0_5b_dynamic_res_finetune_remote.sh

GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/mini_internvl/domain_adaptation/internvl2_2b_
˓→internlm2_1_8b_dynamic_res_finetune_remote.sh

GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/mini_internvl/domain_adaptation/internvl2_4b_
˓→phi3_3_8b_dynamic_res_finetune_remote.sh

**Evaluation**

We assess the performance of our transferred model using the RSVQA dataset for the VQA task and the DIOR-RSVG
dataset for the visual grounding task.

- RS-VQA

We chose the Presence, Comparison, and Rural/Urban subsets of the RSVQA-LR and RSVQA-HR datasets for assess-
ment.

You can now directly use the following command to run the test with 8 GPUs:

# RSVQA-LR
GPUS= 8 sh evaluate.sh${checkpoint} rsvqa-lr --dynamic --max-num 12
# RSVQA-HR-test1
GPUS= 8 sh evaluate.sh${checkpoint} rsvqa-hr-test1 --dynamic --max-num 12
# RSVQA-LR-test2
GPUS= 8 sh evaluate.sh${checkpoint} rsvqa-hr-test2 --dynamic --max-num 12

### • DIOR-RSVG

You can now directly use the following command to run the test with 8 GPUs:

GPUS= 8 sh evaluate.sh${checkpoint} dior-rsvg --dynamic --max-num 12

**1.25.4 Autonomous Driving with Temporal Information**

Coming soon...

**1.25.5 Citation**

If yound this project useful in your research, please consider citing:

@article{gao2024mini,
title={Mini-InternVL: a flexible-transfer pocket multi-modal model with 5\% parameters␣
˓→and 90\% performance},
author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun␣
˓→and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and␣
(continues on next page)

**330 Chapter 1. Documentation**


(continued from previous page)
˓→others},
journal={Visual Intelligence},
volume={2},
number={1},
pages={1--17},
year={2024},
publisher={Springer}
}

## 1.26 Mixed Preference Optimization

**1.26.1 Model Preparation**

```
model name type param download size
InternVL2-8B MLLM 8.1B HF link 16 GB
InternVL2-8B-MPO MLLM 8.1B HF link 16 GB
```
Before starting the preference optimization, download the pre-trained model we provide.

cd ckpt/
# pip install -U huggingface_hub
# Download OpenGVLab/InternVL2-8B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-8B --local-dir InternVL2-8B
# Download OpenGVLab/InternVL2-8B-MPO
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL2-8B-MPO --local-dir InternVL2-8B-MPO

The directory structure is:

ckpt
InternVL2-8B
InternVL2-8B-MPO

**1.26.2 Prepare Our MMPR Dataset**

To prepare the training data, pleaserst download ourMMPR datasetandthe JSONle.

Our dataset contains approximately 3 million preference pairs, of which only around 1.0 million are utilized during
training. You can adjust the number of active data samples and the data mixture ratio by modifying the repeat param-
eter in the JSONle.

The directory structure is:

MMPR
images
annotations

Please note that our training data includes instructions collected fromInternVL demo. However, due to privacy pro-
tection concerns, we are unable to release these portion of the data. Therefore, the reproduced results on general VQA
( _i.e._ , MMVet, LLaVABench, and MMHal-Bench) may be inferior toour released model.

**1.26. Mixed Preference Optimization 331**


We recommend incorporating additional general VQA data to preserve the general VQA abilities, following _our
DropoutNTP pipeline_.

**1.26.3 Prepare Customized Data**

If you want to prepare your customized preference data, please create a JSONle similar tothis example.

The format for the JSONle should be:

{
"your-custom-dataset-1": {
"root": "path/to/the/image/",
"annotation":"path/to/the/jsonl/annotation",
"data_augment": false,
"max_dynamic_patch": 12 ,
"repeat_time": 1 ,
"length":"number of samples in the dataset"
}
}

Example:

{
"scienceqa_multi_choice_en_20240402_extracted_pairs_vqa_format_rules": {
"root": "MMPR/images/ScienceQA",
"annotation":"MMPR/annotations/scienceqa_multi_choice_en_20240402_extracted_
˓→pairs_vqa_format_rules.jsonl",
"data_augment": false,
"repeat_time": 1 ,
"length": 66457
},
}

The format for each specic JSONL (such as plain text data, single-image data, multi-image data) can be organized as
the following format:

{"image":"1.png","question":"xxx", "chosen":"xxx", "rejected":"xxx",}
{"image":"2.png","question":"xxx", "chosen":"xxx", "rejected":"xxx",}
...

Our suggestion is to add new domain-specic data on top ofMMPR. This will enhance downstream capabilities while
retaining the foundational skills. Of course, you can also choose to ne-tune solely on the new data based on your
requirements.

**1.26.4 Start Preference Optimization**

Commands for preference optimization:

cd internvl_chat
sh shell/internvl2.0_mpo/preference_optimization/internvl2_8b_internlm2_7b_dynamic_res_
˓→mpo_full.sh

If you encounter any issues, please let us know, and we will update the training guide to enhance its usability.

```
Based on the environment of InternVL, you need to additionally run pip install trl==0.9.6.
```
**332 Chapter 1. Documentation**


**1.26.5 Evaluation**

To evaluate the resulting model with Chain-of-Though (CoT), please use the following commands:

# M3CoT
GPUS= 8 sh evaluate.sh ckpt/InternVL2-8B-MPO m3cot --dynamic --cot
# MathVista
GPUS= 8 sh evaluate.sh ckpt/InternVL2-8B-MPO mathvista-testmini --dynamic --cot
# POPE
GPUS= 8 sh evaluate.sh ckpt/InternVL2-8B-MPO pope --dynamic --cot

Please note that we have organized the M3CoT data into the same format as ScienceQA. You can download the re-
organized jsonllehere.

```
We evaluate the performance on other benchmarks ( e.g. , MMVet, LLaVABench, and CRPE) using
VLMEvalKit. You need to set use_mpo_prompt=True incong.pyto activate the CoT prompt.
```
**1.26.6 Generate Additional Preference Data**

To construct additional open-ended VQA preference data, you can use ourDropoutNTP pipelinewith the following
command:

srun -p${PARTITION}\
--gres=gpu:${GPUS_PER_NODE} \
--nodes=${NODES}\
--ntasks=${GPUS}\
--ntasks-per-node=${GPUS_PER_NODE}\
--cpus-per-task=${CPUS_PER_TASK} \
--kill-on-bad-exit= 1 \
--quotatype=${QUOTA_TYPE}\
python -u tools/reasoning_data_pipeline/mmpr_data_pipeline_dropout_ntp.py\
--checkpoint${model_path}\ # the model you want to use to generate negative␣
˓→samples
--prompt-path${dataset} \ # please refer to the following format example
--out-dir${out_dir}\ # the output directory you want to save the resulting data
--batch-size 1 \
--num-workers 8 \
--num-return-sequences 1 \ # the number of generated negative samples per item
--top-k 50 \
--temperature 1 .0 \
--dynamic\
--max-num${max_num}\ # max_tiles when enabling dynamic resolution
--sample-max-num 500000 \
--tp 8 \
--start-ratio${START_RATIO} \ # We set it to 0.5 by default
2 >& 1 | tee -a"${LOG_PATH}" # the file path you want to save your log

The format for the promptle should be:

{"image":"1.png","question":"xxx", "chosen":"xxx", "rejected":null,}
{"image":"2.png","question":"xxx", "chosen":"xxx", "rejected":null,}
...

To constrct additional CoT reasoning preference data, you can use ourcorrectness-based pipelinewith the following
command:

**1.26. Mixed Preference Optimization 333**


srun -p${PARTITION}\
--gres=gpu:${GPUS_PER_NODE} \
--nodes=${NODES}\
--ntasks=${GPUS}\
--ntasks-per-node=${GPUS_PER_NODE}\
--cpus-per-task=${CPUS_PER_TASK} \
--kill-on-bad-exit= 1 \
--quotatype=${QUOTA_TYPE}\
python -u tools/reasoning_data_pipeline/mmpr_data_pipeline_correctness.py\
--checkpoint${model_path}\ # the model you want to use to generate negative␣
˓→samples
--prompt-path${dataset} \ # please refer to the following format example
--out-dir${out_dir}\ # the output directory you want to save the resulting data
--batch-size 1 \
--num-workers 8 \
--num-return-sequences 32 \ # the number of generated reasoning processes per item
--top-k 50 \
--temperature 1 .0 \
--dynamic\
--max-num${max_num}\ # max_tiles when enabling dynamic resolution
--sample-max-num 20000 \
--tp 8 \
2 >& 1 | tee -a"${LOG_PATH}" # the file path you want to save your log

The format for the promptle should be:

{"image":"1.png","question":"xxx", "answer":"xxx"}
{"image":"2.png","question":"xxx", "answer":"xxx"}
...

After sample multiple reasoning processes, you can use this command to convert them into preference data based on
the correctness:

python -u tools/mm_reasoning_pipeline/internvl_lmdeploy_cot_postprocess.py\
--data-dir"${data_dir}" \ # should be same with the ${out_dir} when sampling␣
˓→reasoning processes
--save-dir"${save_dir}" \ # the output directory you want to save the resulting␣
˓→data
--answer-fix\
--force\
--num-pairs-per-key 15 \
--max-lines 1200000 \

**1.26.7 Citation**

If yound this project useful in your research, please consider citing:

@article{wang2024mpo,
title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed␣
˓→Preference Optimization},
author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and␣
˓→Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai,␣
˓→Jifeng},
(continues on next page)

**334 Chapter 1. Documentation**


(continued from previous page)
journal={arXiv preprint arXiv:2411.10442},
year={2024}
}
@article{chen2024far,
title={How far are we to gpt-4v? closing the gap to commercial multimodal models with␣
˓→open-source suites},
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei␣
˓→and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and␣
˓→others},
journal={Science China Information Sciences},
volume={67},
number={12},
pages={220101},
year={2024},
publisher={Springer}
}
@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
˓→Recognition},
pages={24185--24198},
year={2024}
}

## 1.27 Introduction of InternVL 1.5 Series

**1.27.1 InternVL-Chat-V1-5**

**1.27. Introduction of InternVL 1.5 Series 335**


**Introduction**

We introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap
between open-source and proprietary commercial models in multimodal understanding. We introduce three simple
designs:

1. Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation
    model——InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and
    reused in dierent LLMs.
2. Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448× 448 pixels according to
    the aspect ratio and resolution of the input images, which supports up to 4K resolution input.
3. High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common
    scenes, document images, and annotated them with English and Chinese question-answer pairs, signicantly
    enhancing performance in OCR- and Chinese-related tasks.

As illustrated in Figure 3, InternVL 1.5 employs an architecture akin to widely-used open-source MLLMs, specically

**336 Chapter 1. Documentation**


the “ViT-MLP-LLM” conguration referenced in various existing studies. Our implementation of this architecture
integrates a pre-trained InternViT-6B with a pre-trained InternLM2-20B using a randomly initialized MLP projector.

During training, we implemented a dynamic resolution strategy, dividing images into tiles of 448× 448 pixels in sizes
ranging from 1 to 12, based on the aspect ratio and resolution of the input images. During testing, this can be zero-shot
scaled up to 40 tiles (i.e., 4K resolution). To enhance scalability for high resolution, we simply employed a pixel shue
(unshue) operation to reduce the number of visual tokens to one-quarter of the original. Therefore, in our model, a
448 × 448 image is represented by 256 visual tokens.

**Performance**

**1.27. Introduction of InternVL 1.5 Series 337**


**1.27.2 Mini-InternVL-Chat-2B/4B-V1-5**

**Introduction**

You can run multimodal large models using a 1080Ti now.

We are delighted to introduce Mini-InternVL-Chat series. In the era of large language models, many researchers have
started to focus on smaller language models, such as Gemma-2B, Qwen-1.8B, and InternLM2-1.8B. Inspired by their
eorts, we have distilled our vision foundation modelInternViT-6B-448px-V1-5down to 300M and usedInternLM2-
Chat-1.8BorPhi-3-mini-128k-instructas our language model. This resulted in a small multimodal model with excellent
performance.

As shown in thegure below, we adopted the same model architecture as InternVL 1.5. We simply replaced the original
InternViT-6B with InternViT-300M and InternLM2-Chat-20B with InternLM2-Chat-1.8B or Phi-3-mini-128k-instruct.
For training, we used the same data as InternVL 1.5 to train this smaller model. Additionally, due to the lower training
costs of smaller models, we used a context length of 8K during training.

From the experimental results, we’ve observed that our distilled small vision model (InternViT-300M) is well-suited
for a smaller language model (1.8B or 3.8B). This combination maximizes eciency while maintaining impressive
performance across various benchmarks, demonstrating the eectiveness of small models in handling complex tasks.
Additionally, our small model signicantly reduces memory requirements, making it more accessible and ecient for
practical use.

**Performance**

**Comparison with SoTA models on 16 multimodal benchmarks.** OCR-related benchmarks include: DocVQA test,
ChartQA average test, InfographicVQA test, TextVQA val, and OCRBench. General multimodal benchmarks encom-
pass: MME, RealWorldQA, AI2D test, MMMU val, MMBench-EN/CN test, CCBench dev, MMVet, SEED Image,
and HallusionBench. Additionally, the math dataset includes MathVista testmini. The MME results we report are the
sum of the perception and cognition scores. The results of OCRBench, MMBench, CCBench, and HallusionBench are
collected from the OpenCompass leaderboard.

**338 Chapter 1. Documentation**


**1.27.3 Citation**

If yound this project useful in your research, please consider citing:

@article{chen2024far,
title={How far are we to gpt-4v? closing the gap to commercial multimodal models with␣
˓→open-source suites},
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei␣
˓→and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and␣
˓→others},
journal={Science China Information Sciences},
volume={67},
number={12},
pages={220101},
year={2024},
publisher={Springer}
}
@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
˓→Recognition},
(continues on next page)

**1.27. Introduction of InternVL 1.5 Series 339**


(continued from previous page)
pages={24185--24198},
year={2024}
}

## 1.28 Quick Start of InternVL 1.5 Series

```
Please use transformers>=4.37.2 to ensure the model works normally.
```
**1.28.1 Model Preparation**

```
model name type param download size
InternVL-Chat-V1-5 MLLM 25.5B HF link 48.0 GB
Mini-InternVL-Chat-2B-V1-5 MLLM 2.2B HF link 4.2 GB
Mini-InternVL-Chat-4B-V1-5 MLLM 4.2B HF link 7.8 GB
```
Download the above model weights and place them in the pretrained/ folder.

cd pretrained/
# pip install -U huggingface_hub
# Download OpenGVLab/InternVL-Chat-V1-5
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL-Chat-V1-5 --local-dir InternVL-Chat-V1-5
# Download OpenGVLab/Mini-InternVL-Chat-2B-V1-5
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/Mini-
˓→InternVL-Chat-2B-V1-5 --local-dir Mini-InternVL-Chat-2B-V1-5
# Download OpenGVLab/Mini-InternVL-Chat-4B-V1-5
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/Mini-
˓→InternVL-Chat-4B-V1-5 --local-dir Mini-InternVL-Chat-4B-V1-5

The directory structure is:

pretrained
Mini-InternVL-Chat-2B-V1-5
Mini-InternVL-Chat-4B-V1-5
InternVL-Chat-V1-5

**1.28.2 Model Loading**

**16-bit (bf16 / fp16)**

InternVL-Chat-V1-5

Mini-InternVL-Chat-2B-V1-5

Mini-InternVL-Chat-4B-V1-5

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL-Chat-V1-5"
model= AutoModel.from_pretrained(
path,
(continues on next page)

**340 Chapter 1. Documentation**


```
(continued from previous page)
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
```
import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/Mini-InternVL-Chat-2B-V1-5"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/Mini-InternVL-Chat-4B-V1-5"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()

**BNB 8-bit Quantization**

InternVL-Chat-V1-5

Mini-InternVL-Chat-2B-V1-5

Mini-InternVL-Chat-4B-V1-5

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL-Chat-V1-5"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/Mini-InternVL-Chat-2B-V1-5"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
(continues on next page)

**1.28. Quick Start of InternVL 1.5 Series 341**


```
(continued from previous page)
use_flash_attn=True,
trust_remote_code=True).eval()
```
import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/Mini-InternVL-Chat-4B-V1-5"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

**BNB 4-bit Quantization**

InternVL-Chat-V1-5

Mini-InternVL-Chat-2B-V1-5

Mini-InternVL-Chat-4B-V1-5

```
Warning: Due to signicant quantization errors with BNB 4-bit quantization on InternViT-6B, the model
may produce nonsensical outputs and fail to understand images. Therefore, please avoid using BNB 4-bit
quantization.
```
import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/Mini-InternVL-Chat-2B-V1-5"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_4bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/Mini-InternVL-Chat-4B-V1-5"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_4bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

**Multiple GPUs**

The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not
being on the same device. By ensuring that therst and last layers of the large language model (LLM) are on the same
device, we prevent such errors.

**342 Chapter 1. Documentation**


InternVL-Chat-V1-5

Mini-InternVL-Chat-2B-V1-5

Mini-InternVL-Chat-4B-V1-5

import math
import torch
fromtransformersimport AutoTokenizer, AutoModel

defsplit_model(model_name):
device_map= {}
world_size= torch.cuda.device_count()
num_layers= {'Mini-InternVL-2B-V1-5': 24 , 'Mini-InternVL-4B-V1-5': 32 , 'InternVL-
˓→Chat-V1-5': 48 }[model_name]
# Since the first GPU will be used for ViT, treat it as half a GPU.
num_layers_per_gpu= math.ceil(num_layers /(world_size-0.5))
num_layers_per_gpu= [num_layers_per_gpu]*world_size
num_layers_per_gpu[ 0 ]=math.ceil(num_layers_per_gpu[ 0 ] *0.5)
layer_cnt= 0
fori, num_layerin enumerate(num_layers_per_gpu):
forjin range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] =i
layer_cnt+= 1
device_map['vision_model']= 0
device_map['mlp1']= 0
device_map['language_model.model.tok_embeddings']= 0
device_map['language_model.model.embed_tokens']= 0
device_map['language_model.output']= 0
device_map['language_model.model.norm']= 0
device_map['language_model.lm_head']= 0
device_map[f'language_model.model.layers.{num_layers- 1 }']= 0

```
return device_map
```
path= "OpenGVLab/InternVL-Chat-V1-5"
device_map =split_model('InternVL-Chat-V1-5')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

import math
import torch
fromtransformersimport AutoTokenizer, AutoModel

defsplit_model(model_name):
device_map= {}
world_size= torch.cuda.device_count()
num_layers= {'Mini-InternVL-2B-V1-5': 24 , 'Mini-InternVL-4B-V1-5': 32 , 'InternVL-
˓→Chat-V1-5': 48 }[model_name]
(continues on next page)

**1.28. Quick Start of InternVL 1.5 Series 343**


```
(continued from previous page)
# Since the first GPU will be used for ViT, treat it as half a GPU.
num_layers_per_gpu= math.ceil(num_layers /(world_size-0.5))
num_layers_per_gpu= [num_layers_per_gpu]*world_size
num_layers_per_gpu[ 0 ]=math.ceil(num_layers_per_gpu[ 0 ] *0.5)
layer_cnt= 0
fori, num_layerin enumerate(num_layers_per_gpu):
forjin range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] =i
layer_cnt+= 1
device_map['vision_model']= 0
device_map['mlp1']= 0
device_map['language_model.model.tok_embeddings']= 0
device_map['language_model.model.embed_tokens']= 0
device_map['language_model.output']= 0
device_map['language_model.model.norm']= 0
device_map['language_model.lm_head']= 0
device_map[f'language_model.model.layers.{num_layers- 1 }']= 0
```
```
return device_map
```
path= "OpenGVLab/Mini-InternVL-Chat-2B-V1-5"
device_map =split_model('Mini-InternVL-Chat-2B-V1-5')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

import math
import torch
fromtransformersimport AutoTokenizer, AutoModel

defsplit_model(model_name):
device_map= {}
world_size= torch.cuda.device_count()
num_layers= {'Mini-InternVL-2B-V1-5': 24 , 'Mini-InternVL-4B-V1-5': 32 , 'InternVL-
˓→Chat-V1-5': 48 }[model_name]
# Since the first GPU will be used for ViT, treat it as half a GPU.
num_layers_per_gpu= math.ceil(num_layers /(world_size-0.5))
num_layers_per_gpu= [num_layers_per_gpu]*world_size
num_layers_per_gpu[ 0 ]=math.ceil(num_layers_per_gpu[ 0 ] *0.5)
layer_cnt= 0
fori, num_layerin enumerate(num_layers_per_gpu):
forjin range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] =i
layer_cnt+= 1
device_map['vision_model']= 0
device_map['mlp1']= 0
device_map['language_model.model.tok_embeddings']= 0
device_map['language_model.model.embed_tokens']= 0
(continues on next page)

**344 Chapter 1. Documentation**


```
(continued from previous page)
device_map['language_model.output']= 0
device_map['language_model.model.norm']= 0
device_map['language_model.lm_head']= 0
device_map[f'language_model.model.layers.{num_layers- 1 }']= 0
```
```
return device_map
```
path= "OpenGVLab/Mini-InternVL-Chat-4B-V1-5"
device_map =split_model('Mini-InternVL-Chat-4B-V1-5')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

**1.28.3 Inference with Transformers**

import numpyas np
import torch
import torchvision.transformsas T
fromdecord import VideoReader, cpu
fromPILimport Image
fromtorchvision.transforms.functionalimport InterpolationMode
fromtransformersimport AutoModel, AutoTokenizer

IMAGENET_MEAN= (0.485, 0.456,0.406)
IMAGENET_STD=(0.229,0.224, 0.225)

defbuild_transform(input_size):
MEAN, STD= IMAGENET_MEAN, IMAGENET_STD
transform= T.Compose([
T.Lambda(lambda img: img.convert('RGB')if img.mode!= 'RGB'elseimg),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform

deffind_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff =float('inf')
best_ratio= ( 1 , 1 )
area= width*height
forratioin target_ratios:
target_aspect_ratio= ratio[ 0 ]/ratio[ 1 ]
ratio_diff= abs(aspect_ratio-target_aspect_ratio)
if ratio_diff< best_ratio_diff:
best_ratio_diff= ratio_diff
best_ratio= ratio
elifratio_diff == best_ratio_diff:
(continues on next page)

**1.28. Quick Start of InternVL 1.5 Series 345**


```
(continued from previous page)
if area>0.5* image_size* image_size* ratio[ 0 ] *ratio[ 1 ]:
best_ratio= ratio
return best_ratio
```
defdynamic_preprocess(image, min_num= 1 , max_num= 12 , image_size= 448 , use_
˓→thumbnail=False):
orig_width, orig_height= image.size
aspect_ratio=orig_width /orig_height

```
# calculate the existing image aspect ratio
target_ratios= set(
(i, j)forn in range(min_num, max_num+ 1 )foriin range( 1 , n+ 1 ) forjin␣
˓→range( 1 , n+ 1 )if
i* j<= max_numandi* j>= min_num)
target_ratios= sorted(target_ratios, key=lambdax: x[ 0 ]* x[ 1 ])
```
```
# find the closest aspect ratio to the target
target_aspect_ratio =find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
```
```
# calculate the target width and height
target_width=image_size *target_aspect_ratio[ 0 ]
target_height= image_size* target_aspect_ratio[ 1 ]
blocks =target_aspect_ratio[ 0 ]* target_aspect_ratio[ 1 ]
```
```
# resize the image
resized_img=image.resize((target_width, target_height))
processed_images=[]
foriin range(blocks):
box=(
(i% (target_width// image_size))*image_size,
(i// (target_width// image_size))* image_size,
((i%(target_width// image_size))+ 1 )* image_size,
((i//(target_width //image_size))+ 1 )* image_size
)
# split the image
split_img= resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) ==blocks
if use_thumbnailand len(processed_images)!= 1 :
thumbnail_img= image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
```
defload_image(image_file, input_size= 448 , max_num= 12 ):
image= Image.open(image_file).convert('RGB')
transform= build_transform(input_size=input_size)
images =dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_
˓→num=max_num)
pixel_values=[transform(image) forimage in images]
pixel_values=torch.stack(pixel_values)
return pixel_values
(continues on next page)

**346 Chapter 1. Documentation**


```
(continued from previous page)
```
# If you have an 80G A100 GPU, you can put the entire model on a single GPU.
# Otherwise, you need to load a model using multiple GPUs, please refer to the`Multiple␣
˓→GPUs` section.
path= 'OpenGVLab/InternVL-Chat-V1-5'
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer= AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# set the max number of tiles in`max_num`
pixel_values=load_image('./examples/image1.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
generation_config= dict(max_new_tokens= 1024 , do_sample=False)

# pure-text conversation ()
question= 'Hello, who are you?'
response, history= model.chat(tokenizer, None, question, generation_config,␣
˓→history=None, return_history=True)
print(f'User:{question}\nAssistant: {response}')

question= 'Can you tell me a story?'
response, history= model.chat(tokenizer, None, question, generation_config,␣
˓→history=history, return_history=True)
print(f'User:{question}\nAssistant: {response}')

# single-image single-round conversation ()
question= '<image>\nPlease describe the image shortly.'
response= model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User:{question}\nAssistant: {response}')

# single-image multi-round conversation ()
question= '<image>\nPlease describe the image in detail.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,␣
˓→history=None, return_history=True)
print(f'User:{question}\nAssistant: {response}')

question= 'Please write a poem according to the image.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,␣
˓→history=history, return_history=True)
print(f'User:{question}\nAssistant: {response}')

# multi-image multi-round conversation, combined images ()
pixel_values1= load_image('./examples/image1.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
pixel_values2= load_image('./examples/image2.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
pixel_values=torch.cat((pixel_values1, pixel_values2), dim= 0 )

question= '<image>\nDescribe the two images in detail.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
history=None, return_history=True)
(continues on next page)

**1.28. Quick Start of InternVL 1.5 Series 347**


```
(continued from previous page)
```
print(f'User:{question}\nAssistant: {response}')

question= 'What are the similarities and differences between these two images.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
history=history, return_history=True)
print(f'User:{question}\nAssistant: {response}')

# multi-image multi-round conversation, separate images ()
pixel_values1= load_image('./examples/image1.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
pixel_values2= load_image('./examples/image2.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
pixel_values=torch.cat((pixel_values1, pixel_values2), dim= 0 )
num_patches_list=[pixel_values1.size( 0 ), pixel_values2.size( 0 )]

question= 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list,
history=None, return_history=True)
print(f'User:{question}\nAssistant: {response}')

question= 'What are the similarities and differences between these two images.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list,
history=history, return_history=True)
print(f'User:{question}\nAssistant: {response}')

# batch inference, single image per sample ()
pixel_values1= load_image('./examples/image1.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
pixel_values2= load_image('./examples/image2.jpg', max_num= 12 ).to(torch.bfloat16).cuda()
num_patches_list=[pixel_values1.size( 0 ), pixel_values2.size( 0 )]
pixel_values=torch.cat((pixel_values1, pixel_values2), dim= 0 )

questions= ['<image>\nDescribe the image in detail.'] *len(num_patches_list)
responses= model.batch_chat(tokenizer, pixel_values,
num_patches_list=num_patches_list,
questions=questions,
generation_config=generation_config)
forquestion, responsein zip(questions, responses):
print(f'User:{question}\nAssistant: {response}')

# video multi-round conversation ()
defget_index(bound, fps, max_frame, first_idx= 0 , num_segments= 32 ):
if bound:
start, end= bound[ 0 ], bound[ 1 ]
else:
start, end= - 100000 , 100000
start_idx= max(first_idx,round(start *fps))
end_idx=min(round(end* fps), max_frame)
seg_size=float(end_idx-start_idx) /num_segments
frame_indices= np.array([
int(start_idx+ (seg_size/ 2 )+np.round(seg_size *idx))
foridxin range(num_segments)
])
(continues on next page)

**348 Chapter 1. Documentation**


```
(continued from previous page)
return frame_indices
```
defload_video(video_path, bound=None, input_size= 448 , max_num= 1 , num_segments= 32 ):
vr =VideoReader(video_path, ctx=cpu( 0 ), num_threads= 1 )
max_frame= len(vr) - 1
fps=float(vr.get_avg_fps())

```
pixel_values_list, num_patches_list= [], []
transform= build_transform(input_size=input_size)
frame_indices= get_index(bound, fps, max_frame, first_idx= 0 , num_segments=num_
˓→segments)
forframe_indexinframe_indices:
img=Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
img=dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_
˓→num=max_num)
pixel_values=[transform(tile)fortilein img]
pixel_values=torch.stack(pixel_values)
num_patches_list.append(pixel_values.shape[ 0 ])
pixel_values_list.append(pixel_values)
pixel_values=torch.cat(pixel_values_list)
return pixel_values, num_patches_list
```
video_path ='./examples/red-panda.mp4'
pixel_values, num_patches_list= load_video(video_path, num_segments= 8 , max_num= 1 )
pixel_values=pixel_values.to(torch.bfloat16).cuda()
video_prefix=''.join([f'Frame{i+ 1 }: <image>\n' foriin range(len(num_patches_list))])
question= video_prefix+'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list, history=None, return_
˓→history=True)
print(f'User:{question}\nAssistant: {response}')

question= 'Describe this video in detail. Don\'t repeat.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list, history=history,␣
˓→return_history=True)
print(f'User:{question}\nAssistant: {response}')

**Streaming Output**

Besides this method, you can also use the following code to get streamed output.

fromtransformersimport TextIteratorStreamer
fromthreading importThread

# Initialize the streamer
streamer= TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True,␣
˓→timeout= 10 )
# Define the generation configuration
generation_config= dict(max_new_tokens= 1024 , do_sample=False, streamer=streamer)
# Start the model chat in a separate thread
(continues on next page)

**1.28. Quick Start of InternVL 1.5 Series 349**


```
(continued from previous page)
```
thread =Thread(target=model.chat, kwargs=dict(
tokenizer=tokenizer, pixel_values=pixel_values, question=question,
history=None, return_history=False, generation_config=generation_config,
))
thread.start()

# Initialize an empty string to store the generated text
generated_text= ''
# Loop through the streamer to get the new text as it is generated
fornew_textin streamer:
if new_text== model.conv_template.sep:
break
generated_text+= new_text
print(new_text, end='', flush=True) # Print each new chunk of generated text on the␣
˓→same line

**1.28.4 Citation**

If yound this project useful in your research, please consider citing:

@article{chen2024far,
title={How far are we to gpt-4v? closing the gap to commercial multimodal models with␣
˓→open-source suites},
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei␣
˓→and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and␣
˓→others},
journal={Science China Information Sciences},
volume={67},
number={12},
pages={220101},
year={2024},
publisher={Springer}
}
@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
˓→Recognition},
pages={24185--24198},
year={2024}
}

**350 Chapter 1. Documentation**


## 1.29 Fine-tune on a Custom Dataset

**1.29.1 Model Preparation**

```
model name type param download size
InternVL-Chat-V1-5 MLLM 25.5B HF link 48.0 GB
Mini-InternVL-Chat-2B-V1-5 MLLM 2.2B HF link 4.2 GB
Mini-InternVL-Chat-4B-V1-5 MLLM 4.2B HF link 7.8 GB
```
Before starting the secondne-tuning, download the pre-trained model we provide.

cd pretrained/
# pip install -U huggingface_hub
# Download OpenGVLab/InternVL-Chat-V1-5
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL-Chat-V1-5 --local-dir InternVL-Chat-V1-5
# Download OpenGVLab/Mini-InternVL-Chat-2B-V1-5
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/Mini-
˓→InternVL-Chat-2B-V1-5 --local-dir Mini-InternVL-Chat-2B-V1-5
# Download OpenGVLab/Mini-InternVL-Chat-4B-V1-5
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/Mini-
˓→InternVL-Chat-4B-V1-5 --local-dir Mini-InternVL-Chat-4B-V1-5

The directory structure is:

pretrained
Mini-InternVL-Chat-2B-V1-5
Mini-InternVL-Chat-4B-V1-5
InternVL-Chat-V1-5

**1.29.2 Prepare Customized Data**

After downloading the pre-trained model, prepare your customized SFT (Supervised Fine-Tuning) data. Create a JSON
le in internvl_chat/shell/data/ similar tothis example.

The format for the JSONle should be:

{
"your-custom-dataset-1": {
"root": "path/to/the/image/",
"annotation":"path/to/the/jsonl/annotation",
"data_augment": false,
"max_dynamic_patch": 12 ,
"repeat_time": 1 ,
"length":"number of samples in the dataset"
}
}

Example:

{
"sharegpt4v_instruct_gpt4-vision_cap100k": {
(continues on next page)

**1.29. Fine-tune on a Custom Dataset 351**


(continued from previous page)
"root": "playground/data/",
"annotation":"playground/opensource/sharegpt4v_instruct_gpt4-vision_cap100k.jsonl",
"data_augment": false,
"max_dynamic_patch": 12 ,
"repeat_time": 1 ,
"length": 102025
}
}

The format for each specic JSONL (such as plain text data, single-image data, multi-image data, video data) can be
organized according to the descriptions provided in _this document_.

My suggestion is to add new domain-specic data on top of the _general data from our open-sourced InternVL 1.2_.
This will enhance downstream capabilities while retaining the foundational skills. Of course, you can also choose to
ne-tune solely on the new data based on your requirements.

**1.29.3 Start 2nd Fine-tuning**

InternVL-Chat-V1-5

Mini-InternVL-Chat-2B-V1-5

Mini-InternVL-Chat-4B-V1-5

Fine-tune the pre-trained models using either thescript for training the full LLMor thescript for training the LoRA
adapter, depending on your available GPU resources.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL-Chat-V1-5.

```
Fine-tuning the full LLM requires 8 A100 80G GPUs, whereas ne-tuning the LoRA requires 2 A100
80G GPUs.
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

# Using 8 GPUs, fine-tune the full LLM, cost about 77G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl1.5/2nd_finetune/internvl_chat_v1_5_
˓→internlm2_20b_dynamic_res_2nd_finetune_full.sh
# Using 2 GPUs, fine-tune the LoRA, cost about 79G per GPU
GPUS= 2 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl1.5/2nd_finetune/internvl_chat_v1_5_
˓→internlm2_20b_dynamic_res_2nd_finetune_lora.sh
# Using 8 GPUs, fine-tune the LoRA, cost about 60G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 2 sh shell/internvl1.5/2nd_finetune/internvl_chat_v1_5_
˓→internlm2_20b_dynamic_res_2nd_finetune_lora.sh

Fine-tune the pre-trained models using either thescript for training the full LLMor thescript for training the LoRA
adapter, depending on your available GPU resources.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/Mini-InternVL-Chat-2B-V1-5.

```
Fine-tuning the full LLM requires 8x 32G/40G GPUs, whereasne-tuning the LoRA requires 2x 32G/40G
GPUs.
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
**352 Chapter 1. Documentation**


Commands forne-tuning:

# Using 8 GPUs, fine-tune the full LLM, cost about 30G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl1.5/2nd_finetune/internvl_chat_v1_5_
˓→internlm2_1_8b_dynamic_res_2nd_finetune_full.sh
# Using 2 GPUs, fine-tune the LoRA, cost about 27G per GPU
GPUS= 2 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl1.5/2nd_finetune/internvl_chat_v1_5_
˓→internlm2_1_8b_dynamic_res_2nd_finetune_lora.sh
# Using 8 GPUs, fine-tune the LoRA, cost about 27G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl1.5/2nd_finetune/internvl_chat_v1_5_
˓→internlm2_1_8b_dynamic_res_2nd_finetune_lora.sh

Fine-tune the pre-trained models using either thescript for training the full LLMor thescript for training the LoRA
adapter, depending on your available GPU resources.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/Mini-InternVL-Chat-4B-V1-5.

```
Fine-tuning the full LLM requires 8x 40G GPUs, whereasne-tuning the LoRA requires 2x 24G GPUs.
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

# Using 8 GPUs, fine-tune the full LLM, cost about 40G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl1.5/2nd_finetune/internvl_chat_v1_5_phi3_
˓→3_8b_dynamic_res_2nd_finetune_full.sh
# Using 2 GPUs, fine-tune the LoRA, cost about 19G per GPU
GPUS= 2 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl1.5/2nd_finetune/internvl_chat_v1_5_phi3_
˓→3_8b_dynamic_res_2nd_finetune_lora.sh
# Using 8 GPUs, fine-tune the LoRA, cost about 19G per GPU
GPUS= 8 PER_DEVICE_BATCH_SIZE= 1 sh shell/internvl1.5/2nd_finetune/internvl_chat_v1_5_phi3_
˓→3_8b_dynamic_res_2nd_finetune_lora.sh

If you encounter any issues, please let me know, and I will update the training guide to enhance its usability.

**1.29.4 Citation**

If yound this project useful in your research, please consider citing:

@article{chen2024far,
title={How far are we to gpt-4v? closing the gap to commercial multimodal models with␣
˓→open-source suites},
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei␣
˓→and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and␣
˓→others},
journal={Science China Information Sciences},
volume={67},
number={12},
pages={220101},
year={2024},
publisher={Springer}
}
@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
(continues on next page)

**1.29. Fine-tune on a Custom Dataset 353**


(continued from previous page)
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
˓→Recognition},
pages={24185--24198},
year={2024}
}

## 1.30 Evaluation of InternVL 1.5 Series

To evaluate the performance of the InternVL 1.5 series across various tasks, follow the instructions for each specic
dataset. Ensure that the appropriate number of GPUs is allocated as specied. The following tests will primarily use
InternVL-Chat-V1-5 as an example.

```
1 We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specically, the re-
sults reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet,
and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and
MathVista were evaluated using the VLMEvalKit.
2 Please note that evaluating the same model using dierent testing toolkits like InternVL and VLMEvalKit
can result in slight dierences, which is normal. Updates to code versions and variations in environment
and hardware can also cause minor discrepancies in results.
3 Note, the dataset description is generated by GPT-4 and may contain errors.
```
**1.30.1 Model Preparation**

```
model name type param download size
InternVL-Chat-V1-5 MLLM 25.5B HF link 48.0 GB
Mini-InternVL-Chat-2B-V1-5 MLLM 2.2B HF link 4.2 GB
Mini-InternVL-Chat-4B-V1-5 MLLM 4.2B HF link 7.8 GB
```
Before evaluation, download the trained model we provide.

cd pretrained/
# pip install -U huggingface_hub
# Download OpenGVLab/InternVL-Chat-V1-5
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL-Chat-V1-5 --local-dir InternVL-Chat-V1-5
# Download OpenGVLab/Mini-InternVL-Chat-2B-V1-5
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/Mini-
˓→InternVL-Chat-2B-V1-5 --local-dir Mini-InternVL-Chat-2B-V1-5
# Download OpenGVLab/Mini-InternVL-Chat-4B-V1-5
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/Mini-
˓→InternVL-Chat-4B-V1-5 --local-dir Mini-InternVL-Chat-4B-V1-5

The directory structure is:

**354 Chapter 1. Documentation**


pretrained
Mini-InternVL-Chat-2B-V1-5
Mini-InternVL-Chat-4B-V1-5
InternVL-Chat-V1-5

**1.30.2 Evaluation using InternVL Codebase**

**Data Preparation**

Please prepare the evaluation data according to the _guidance provided here_.

### MME

MME is a comprehensive benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on both
perception and cognition abilities across 14 dierent subtasks, ensuring robust and diverse testing of these models.

Please use the following command to perform the test with 1 GPU:

GPUS= 1 sh evaluate.sh pretrained/InternVL-Chat-V1-5 mme --dynamic --max-num 12

The expected test results are:

total score:1658.3683473389356

```
existence score:190.0
count score:175.0
position score:171.66666666666669
color score:178.33333333333331
posters score:173.8095238095238
celebrity score:142.05882352941177
scene score:156.5
landmark score:179.5
artwork score:144.0
OCR score:147.5
```
===========Cognition===========
total score:533.5714285714286

```
commonsense_reasoning score:133.57142857142858
numerical_calculation score:117.5
text_translation score:185.0
code_reasoning score:97.5
```
# 1658.3683473389356 + 533.5714285714286 = 2191.939775910364

### OKVQA

OKVQA (Outside Knowledge Visual Question Answering) is a dataset designed for visual question answering tasks
that require external knowledge beyond what is visible in the image, featuring over 14,000 questions to evaluate the
reasoning abilities of AI models.

Please use the following command to perform the test with 8 GPU:

**1.30. Evaluation of InternVL 1.5 Series 355**


GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 vqa-okvqa-val --dynamic

The expected test results are:

okvqa_val0.6203147047165996

**TextVQA**

TextVQA is a dataset designed to evaluate visual question answering models by requiring them to read and reason
about text present within images, containing 45,336 questions over 28,408 images from the OpenImages dataset.

The TextVQA dataset provides ocial OCR results, specically Rosetta OCR tokens. During testing with InstructBLIP
and LLaVA 1.5, the OCR results are input to the LLM as a prompt. If you want to input Rosetta OCR tokens, use the
following command:

We do not use Rosetta OCR tokens, run this command:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 vqa-textvqa-val --dynamic --max-num␣
˓→ 24

The expected test results are:

['pretrained/InternVL-Chat-V1-5', 'textvqa_val',0.8061000000000043]

**VizWiz**

The VizWiz VQA dataset is a visual question answering dataset created to help answer visual questions posed by blind
individuals. It contains over 31,000 visual questions, where users took a picture using a mobile phone and recorded a
spoken question about it. Each question comes with 10 crowdsourced answers. This dataset addresses tasks such as
predicting the answer to a visual question and determining whether a visual question can be answered.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 vqa-vizwiz-val --dynamic

The expected test results are:

vizwiz_val 0.6353929150266284

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 vqa-vizwiz-test --dynamic

For the test set, submit the results to theevaluation server.

**ChartQA**

The ChartQA dataset is a comprehensive benchmark for question answering about charts that involves both visual and
logical reasoning. It includes a mix of 9.6K human-written questions and 23.1K machine-generated questions derived
from chart summaries. This dataset is designed to evaluate models that can understand and analyze charts by answering
complex questions that often require multiple logical and arithmetic operations, as well as referencing visual features
of the charts.

The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. Thenal score
for model evaluation is calculated as the average of the scores on these two test sets:

**356 Chapter 1. Documentation**


GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 vqa-chartqa-test --dynamic --max-num␣
˓→ 12

The expected test results are:

['chartqa_test_human', {'relaxed_accuracy':0.736}]
['chartqa_test_augmented', {'relaxed_accuracy':0.9408}]
# average score = (73.6 + 94.08) / 2 = 83.8

**DocVQA**

The DocVQA dataset consists of 50,000 questions on 12,000+ document images. It is designed for visual question
answering tasks where questions are answered using text within the document images. The dataset includes OCR
transcriptions and ground truth answers, supporting evaluation of models that interpret and extract information from
documents.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 vqa-docvqa-val --dynamic --max-num 18

The expected test results are:

Overall ANLS:0.9049

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 vqa-docvqa-test --dynamic --max-num␣
˓→ 18

For the test set, submit the results to theevaluation server.

The expected test results are:

Overall ANLS:0.909

### AI2D

The AI2D dataset contains over 5,000 grade school science diagrams with extensive annotations and 15,000 multiple-
choice questions for research on diagram understanding and question answering.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 vqa-ai2d-test --dynamic

The expected test results are:

ai2diagram_test {'accuracy': 0.8073186528497409}

**InfographicVQA**

The InfographicVQA dataset is a collection of infographics accompanied by natural language questions and answers.
This dataset includes a diverse range of infographics sourced from thousands of dierent websites, ensuring a variety
of layouts and designs. It comprises 30,035 questions across 5,485 images, split into training, validation, and test sets.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 vqa-infovqa-val --dynamic --max-num␣
˓→ 24

**1.30. Evaluation of InternVL 1.5 Series 357**


The expected test results are:

Overall ANLS:0.7235

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 vqa-infovqa-test --dynamic --max-num␣
˓→ 24

For the test set, submit the results to theevaluation server.

The expected test results are:

Overall ANLS:0.725

### GQA

The GQA dataset is a large-scale visual question answering dataset designed for real-world visual reasoning and com-
positional question answering. It contains over 22 million questions grounded in real images, each accompanied by
detailed scene graphs that describe objects, their attributes, and relationships within the scene. The dataset includes im-
ages from the Visual Genome dataset, with questions that require various reasoning skills such as spatial understanding
and multi-step inference.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 vqa-gqa-testdev --dynamic

The expected test results are:

Accuracy:65.76%

**ScienceQA**

The ScienceQA dataset is a large-scale benchmark for multimodal science question answering, consisting of 21,208
multiple-choice questions derived from elementary and high school science curricula. This dataset features a diverse
range of topics across natural science, social science, and language science. It includes questions with image context
(48.7%), text context (48.2%), and both (30.8%).

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 scienceqa --dynamic

The expected test results are:

Acc@ 1 : 0.9404972731779871

### POPE

The POPE (Polling-based Object Probing Evaluation) dataset is designed to evaluate object hallucination in MLLMs.
The dataset consists of 3,000 questions related to the captions of 500 images. By treating the MLLMs’ answers to these
questions as a binary classication task, the dataset allows researchers to measure accuracy, precision, recall, and F1
scores to determine the extent of hallucination in the models.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 pope --dynamic

The expected test results are:

**358 Chapter 1. Documentation**


Category: random,# samples: 2910
TP FP TN FN
1227 21 1389 273
Accuracy:0.8989690721649485
Precision: 0.9831730769230769
Recall:0.818
F1 score:0.8930131004366811
Yes ratio: 0.4288659793814433
0.893, 0.899,0.983, 0.818,0.429
====================================
Category: popular,# samples: 3000
TP FP TN FN
1227 48 1452 273
Accuracy:0.893
Precision: 0.9623529411764706
Recall:0.818
F1 score:0.8843243243243243
Yes ratio: 0.425
0.884, 0.893,0.962, 0.818,0.425
====================================
Category: adversarial,# samples: 3000
TP FP TN FN
1227 82 1418 273
Accuracy:0.8816666666666667
Precision: 0.9373567608861727
Recall:0.818
F1 score:0.8736205055179779
Yes ratio: 0.43633333333333335
0.874, 0.882,0.937, 0.818,0.436
====================================

(89.3+ 88.4+87.4) / 3 =88.3

**Tiny LVLM**

The Tiny LVLM-eHub is a streamlined evaluation benchmark designed to assess the multimodal capabilities of
MLLMs, including models like Bard. It focuses on six categories of multimodal abilities: visual perception, visual
knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 tiny_lvlm --dynamic

The expected test results are:

Visual_Knowledge_Acquisition:0.7542857142857143
Object_Hallucination:0.8933333333333333
Visual_Commonsense: 0.634
Visual_Perception:0.5775
Visual_Reasoning:0.7236363636363636
Overall:3.582755411255411

**1.30. Evaluation of InternVL 1.5 Series 359**


### MMMU

The MMMU dataset is a comprehensive benchmark designed to evaluate multimodal models on college-level tasks that
require domain-specic knowledge and reasoning. It includes 11,500 questions sourced from college exams, quizzes,
and textbooks, spanning six disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social
Science, and Tech & Engineering. These questions cover 30 subjects and feature 30 types of images, such as charts,
diagrams, maps, tables, and more.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 mmmu-val --dynamic

The expected test results are:

{'Overall-Art and Design': {'num': 120 ,'acc': 0.608}, 'Art': {'num': 30 ,'acc': 0.7},
'Art_Theory': {'num': 30 ,'acc': 0.8}, 'Design': {'num': 30 ,'acc': 0.767},
'Music': {'num': 30 , 'acc':0.167},'Overall-Business': {'num': 150 , 'acc': 0.413},
'Accounting': {'num': 30 ,'acc': 0.467},'Economics': {'num': 30 ,'acc': 0.4},
'Finance': {'num': 30 ,'acc': 0.4},'Manage': {'num': 30 ,'acc':0.4},
'Marketing': {'num': 30 , 'acc':0.4}, 'Overall-Science': {'num': 150 ,'acc': 0.38},
'Biology': {'num': 30 ,'acc': 0.6},'Chemistry': {'num': 30 ,'acc': 0.233},
'Geography': {'num': 30 , 'acc':0.4}, 'Math': {'num': 30 ,'acc':0.333},
'Physics': {'num': 30 ,'acc': 0.333}, 'Overall-Health and Medicine': {'num': 150 , 'acc':␣
˓→0.433},
'Basic_Medical_Science': {'num': 30 , 'acc':0.5}, 'Clinical_Medicine': {'num': 30 ,'acc
˓→':0.5},
'Diagnostics_and_Laboratory_Medicine': {'num': 30 , 'acc':0.333},
'Pharmacy': {'num': 30 , 'acc':0.367}, 'Public_Health': {'num': 30 , 'acc':0.467},
'Overall-Humanities and Social Science': {'num': 120 ,'acc':0.617},
'History': {'num': 30 ,'acc': 0.633}, 'Literature': {'num': 30 , 'acc':0.8},
'Sociology': {'num': 30 , 'acc':0.567},'Psychology': {'num': 30 ,'acc': 0.467},
'Overall-Tech and Engineering': {'num': 210 ,'acc':0.362}, 'Agriculture': {'num': 30 ,
˓→'acc':0.567},
'Architecture_and_Engineering': {'num': 30 ,'acc': 0.267}, 'Computer_Science': {'num':␣
˓→ 30 ,'acc':0.367},
'Electronics': {'num': 30 ,'acc': 0.3},'Energy_and_Power': {'num': 30 , 'acc':0.333},
'Materials': {'num': 30 , 'acc':0.467},'Mechanical_Engineering': {'num': 30 ,'acc': 0.
˓→ 233 },
'Overall': {'num': 900 , 'acc':0.452}}

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 mmmu-test --dynamic

For the test set, submit the results to theevaluation server.

The expected test results are:

# result of the test-en set
A_Overall (test) 0.8217488789237668
# result of the test-cn set
A_Overall (test) 0.8195067264573991

**360 Chapter 1. Documentation**


**MMVet (GPT-4-0613)**

```
Warning: Here, we use GPT-4-0613 as the judge model, while in VLMEvalKit, GPT-4-Turbo is used
as the judge model. Using dierent versions of GPT-4 can result in signicant score variations. Therefore,
testing the same model with the two codebases can lead to notable score dierences.
```
The MM-Vet dataset is a comprehensive benchmark designed to evaluate the integrated capabilities of MLLMs. It
encompasses six core vision-language (VL) capabilities: recognition, knowledge, optical character recognition (OCR),
spatial awareness, language generation, and math. The dataset includes 200 images and 218 questions, each requir-
ing one or more of these capabilities to answer. The evaluation uses an open-ended LLM-based approach, allowing
assessment across various answer styles and question types.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 mmvet --dynamic

Then, submit the results to theevaluation server. The expected test results are:

total
62.7

**MMBench**

The MMBench dataset is a comprehensive multi-modality benchmark designed to evaluate thene-grained abilities of
vision-language models. It contains around 3,000 multiple-choice questions covering 20 ability dimensions, structured
into a hierarchical taxonomy. These dimensions include perception and reasoning abilities, further broken down into
specic skills like coarse andne-grained perception, attribute reasoning, and logic reasoning.

For the English dev / test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 mmbench-dev-en --dynamic
GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 mmbench-test-en --dynamic

Then, submit the results to theevaluation server. The expected test results are:

mmbench-dev-en:80.8
mmbench-test-en:82.2

For the Chinese dev / test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 mmbench-dev-cn --dynamic
GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 mmbench-test-cn --dynamic

Then, submit the results to theevaluation server. The expected test results are:

mmbench-dev-cn:80.3
mmbench-test-cn:82.0

**CCBench**

CCBench, a multi-modal benchmark in the domain of Chinese Culture, is designed to evaluate the performance of
MLLMs on tasks specically related to Chinese cultural content.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 ccbench-dev --dynamic

Then, submit the results to theevaluation server. The expected test results are:

**1.30. Evaluation of InternVL 1.5 Series 361**


A_Overall (dev) 0.7

### SEED

CCBench is a multimodal benchmark specically designed to evaluate models on tasks related to Chinese culture. It
is part of the larger MMBench suite of benchmarks, developed by the OpenCompass Community, and aims to provide
ne-grained evaluations across various capabilities of vision-language models. CCBench includes 510 questions in a
multiple-choice format, focusing on cultural knowledge and understanding.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 seed --dynamic

The expected test results are:

Acc@ 1 : 0.6999444135630906
length: 17990
Accuracyforeach datatype:
DatatypeScene Understanding:80.37%
DatatypeInstance Identity: 80.45%
DatatypeInstance Location: 78.03%
DatatypeInstance Attributes:72.39%
DatatypeInstances Counting: 69.19%
DatatypeSpatial Relation:59.82%
DatatypeInstance Interaction:77.32%
DatatypeVisual Reasoning:78.85%
DatatypeText Understanding: 55.81%
DatatypeAction Recognition: 54.08%
DatatypeAction Prediction: 44.82%
DatatypeProcedure Understanding: 40.18%
Total accuracy: 69.99%
Image accuracy: 75.99%
Video accuracy: 47.27%

### MMVP

The MMVP dataset is designed to benchmark the performance of multimodal large language models (MLLMs) in
visual question answering tasks. This dataset focuses on identifying “CLIP-blind pairs,” which are images that appear
similar to the CLIP model despite having clear visual dierences. The MMVP dataset includes 300 images derived
from ImageNet-1k and LAION-Aesthetics, each paired with straightforward questions to evaluate the models’ visual
capabilities. It highlights the challenges these systems face, often leading to incorrect responses and hallucinated
explanations.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 mmvp --dynamic

The expected test results are:

Evaluating MMVP ...
Results saved to results/MMVP_240727104303.jsonl
The accuracyis 0.5933333333333334

**362 Chapter 1. Documentation**


**LLaVA-Bench (GPT-4-0613)**

```
Warning: Here, we use GPT-4-0613 as the judge model, while in VLMEvalKit, GPT-4-Turbo is used
as the judge model. Using dierent versions of GPT-4 can result in signicant score variations. Therefore,
testing the same model with the two codebases can lead to notable score dierences.
```
The LLaVA-Bench-in-the-Wild dataset is designed to evaluate the capabilities of MLLMs in handling more complex
and diverse visual tasks. It includes a set of 24 images with 60 associated questions, covering a range of indoor and
outdoor scenes, memes, paintings, and sketches. Each image is paired with detailed, manually curated descriptions and
questions that test the model’s generalizability to novel domains.

export OPENAI_API_KEY='your openai key'
GPUS= 1 sh evaluate.sh pretrained/InternVL-Chat-V1-5 llava-bench --dynamic

The expected test results are:

all*93.8* 85.8 80.5
llava_bench_complex [8.643,8.286]95.9
llava_bench_complex 95.9 86.4 82.9
llava_bench_conv [8.471, 8.029]94.8
llava_bench_conv94.8 84.7 80.3
llava_bench_detail [8.6, 7.633]88.8
llava_bench_detail88.8 86.0 76.3

**MathVista**

The MathVista dataset is a comprehensive benchmark for evaluating mathematical reasoning within visual contexts. It
consists of three newly created datasets—IQTest, FunctionQA, and PaperQA—designed to address logical reasoning
on puzzle testgures, algebraic reasoning over functional plots, and scientic reasoning with academic papergures,
respectively.

export OPENAI_API_KEY='your-openai-key'
GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 mathvista-testmini --dynamic

The expected test results are:

Correct: 535 , Total: 1000 , Accuracy:53.5%
1000
Number of test problems: 1000

Type: [question_type]
[free_form]:47.17% ( 217 / 460 )
[multi_choice]: 58.89%( 318 / 540 )

Type: [answer_type]
[float]: 0.00%( 0 / 40 )
[integer]: 51.67%( 216 / 418 )
[text]:58.89%( 318 / 540 )
[list]: 50.00%( 1 / 2 )

Type: [language]
[english]: 53.31%( 499 / 936 )
[chinese]: 56.45%( 35 / 62 )
[persian]: 50.00%( 1 / 2 )

**1.30. Evaluation of InternVL 1.5 Series 363**


**RefCOCO Series**

RefCOCO, RefCOCO+, and RefCOCOg are datasets used for tasks involving referring expression comprehension,
segmentation, and generation. These datasets are built upon the MSCOCO dataset, and they are essential for evaluating
models in natural language processing and computer vision.

GPUS= 8 sh evalulate.sh pretrained/InternVL-Chat-V1-5 refcoco --dynamic

The expected test results are:

RefCOCO val, 91.4
RefCOCO testA, 93.7
RefCOCO testB, 87.1
RefCOCO+ val, 87.0
RefCOCO+ testA, 92.3
RefCOCO+ testB, 80.9
RefCOCO-g val, 88.5
RefCOCO-g test, 89.3

**MVBench**

MVBench is a comprehensive multimodal video understanding benchmark developed to evaluate the temporal com-
prehension capabilities of MLLMs. It includes 20 challenging video tasks that require temporal understanding and
cannot be eectively solved using a single frame. The benchmark uses a novel static-to-dynamic method, transforming
static tasks into dynamic ones to systematically generate video tasks that demand a wide range of temporal skills, from
perception to cognition.

We evaluate our models on MVBench by extracting 16 frames from each video, and each frame was resized to a
448x448 image.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-5 mvbench --dynamic --max-num 1

The expected test results are:

{"Action Sequence": 63.0,"Action Prediction": 65.0,"Action Antonym":51.0, "Fine-
˓→grained Action":44.0,
"Unexpected Action": 74.5,"Object Existence": 48.5,"Object Interaction":71.5, "Object␣
˓→Shuffle": 36.5,
"Moving Direction": 37.5,"Action Localization":36.5, "Scene Transition":85.5, "Action␣
˓→Count":36.0,
"Moving Count": 39.5,"Moving Attribute": 61.5,"State Change": 50.0,"Fine-grained Pose
˓→":58.5,
"Character Order":62.0, "Egocentric Navigation": 35.0, "Episodic Reasoning":44.5,
"Counterfactual Inference":42.0, "Avg":52.125}

**1.30.3 Evaluation using VLMEvalKit Codebase**

**Data Preparation**

VLMEvalKit will automatically download the data for evaluation, so you do not need to prepare it manually.

**364 Chapter 1. Documentation**


**MathVista**

The MathVista dataset is a comprehensive benchmark for evaluating mathematical reasoning within visual contexts. It
consists of three newly created datasets—IQTest, FunctionQA, and PaperQA—designed to address logical reasoning
on puzzle testgures, algebraic reasoning over functional plots, and scientic reasoning with academic papergures,
respectively.

torchrun --nproc-per-node= 8 run.py --data MathVista_MINI --model InternVL-Chat-V1-5 --
˓→verbose

The expected test results are:

"Task&Skill","tot","prefetch","hit","prefetch_rate","acc"
"Overall","1000","545","544","54.50000000000001","54.400000000000006"
"scientific reasoning","122","89","69","72.95081967213115","56.557377049180324"
"textbook question answering","158","102","85","64.55696202531645","53.79746835443038"
"numeric commonsense","144","39","42","27.083333333333332","29.166666666666668"
"arithmetic reasoning","353","145","208","41.07648725212464","58.92351274787535"
"visual question answering","179","92","92","51.39664804469274","51.39664804469274"
"geometry reasoning","239","150","98","62.76150627615063","41.00418410041841"
"algebraic reasoning","281","178","114","63.345195729537366","40.569395017793596"
"geometry problem solving","208","141","81","67.78846153846155","38.94230769230769"
"math word problem","186","68","119","36.55913978494624","63.97849462365591"
"logical reasoning","37","16","6","43.24324324324324","16.216216216216218"
"figure question answering","269","142","167","52.78810408921933","62.0817843866171"
"statistical reasoning","301","140","216","46.51162790697674","71.76079734219269"

**HallusionBench**

HallusionBench is a comprehensive benchmark designed to evaluate image-context reasoning in MLLMs, focusing on
identifying issues related to language hallucination and visual illusion. The dataset consists of 346 images paired with
1,129 questions crafted by human experts. These questions are divided into two categories: Visual Dependent (VD)
and Visual Supplement (VS), allowing the benchmark to assess the nuanced understanding and interpretation of visual
data by MLLMs.

torchrun --nproc-per-node= 8 run.py --data HallusionBench --model InternVL-Chat-V1-5 --
˓→verbose

The expected test results are:

2024 - 04 - 29 00: 46 : 23 , 688 - Evaluation -INFO- Score:
2024 - 04 - 29 00: 46 : 23 , 688 - Evaluation -INFO- split aAcc fAcc ␣
˓→qAcc
0 Overall 66.771819 40.173410 40.879121
1 VD 63.620981 40.000000 34.296029
2 VS 71.944444 40.517241 51.123596
3 VD_figure 77.500000 65.853659 53.846154
4 VS_map 56.250000 18.181818 18.750000
5 VD_illusion 66.666667 41.935484 34.722222
6 VS_table 75.892857 46.428571 55.813953
7 VD_ocr 78.651685 58.139535 58.139535
8 VS_ocr 59.259259 38.461538 22.222222
9 VS_chart 81.538462 50.000000 72.368421
10 VD_video 51.176471 10.416667 13.043478
(continues on next page)

**1.30. Evaluation of InternVL 1.5 Series 365**


```
(continued from previous page)
```
11 VD_math 56.481481 25.000000 27.777778

result =(66.77 +40.17 +40.87) / 3 =49.3

**MMStar**

The MMStar dataset is an advanced multimodal benchmark designed to evaluate the capabilities of MLLMs. It com-
prises 1,500 carefully selected samples that are balanced and puried to ensure they exhibit visual dependency and
minimal data leakage. The dataset evaluates models across six core capabilities and 18 detailed axes, focusing on
complex multimodal tasks that require advanced reasoning and understanding of visual content.

torchrun --nproc-per-node= 8 run.py --data MMStar --model InternVL-Chat-V1-5 --verbose

The expected test results are:

2024 - 04 - 29 18: 21 : 56 , 491 - Evaluation -INFO- split Overall ... math science &␣
˓→technology
0 none 0.572667 ... 0.564 0.408

**OCRBench**

OCRBench is a comprehensive evaluation benchmark designed to assess the OCR capabilities of MLLMs. It includes
ve components: Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA,
Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). The benchmark
encompasses data from 29 datasets, making it one of the most thorough OCR evaluation tools available. OCRBench
aims to reveal both the strengths and weaknesses of MLLMs, particularly in handling multilingual text, handwritten text,
non-semantic text, and mathematical expressions. The benchmark includes 1,000 question-answer pairs, all manually
veried for precision.

torchrun --nproc-per-node= 8 run.py --data OCRBench --model InternVL-Chat-V1-5 --verbose

The expected test results are:

2024 - 04 - 29 00: 28 : 29 , 681 - Evaluation -INFO- Score:
2024 - 04 - 29 00: 28 : 29 , 681 - Evaluation -INFO- Text Recognition: 238
2024 - 04 - 29 00: 28 : 29 , 681 - Evaluation -INFO- Scene Text-centric VQA: 178
2024 - 04 - 29 00: 28 : 29 , 681 - Evaluation -INFO- Doc-oriented VQA: 151
2024 - 04 - 29 00: 28 : 29 , 681 - Evaluation -INFO- Key Information Extraction: 153
2024 - 04 - 29 00: 28 : 29 , 681 - Evaluation -INFO- Handwritten Mathematical Expression␣
˓→Recognition: 4
2024 - 04 - 29 00: 28 : 29 , 681 - Evaluation -INFO- Final Score: 724
2024 - 04 - 29 00: 28 : 29 , 681 - Evaluation -INFO- Final Score Norm:72.4

### MMMU

The MMMU dataset is a comprehensive benchmark designed to evaluate multimodal models on college-level tasks that
require domain-specic knowledge and reasoning. It includes 11,500 questions sourced from college exams, quizzes,
and textbooks, spanning six disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social
Science, and Tech & Engineering. These questions cover 30 subjects and feature 30 types of images, such as charts,
diagrams, maps, tables, and more.

**366 Chapter 1. Documentation**


torchrun --nproc-per-node= 8 run.py --data MMMU_DEV_VAL --model InternVL-Chat-V1-5 --
˓→verbose

The expected test results are:

2024 - 04 - 29 18: 20 : 04 , 977 - Evaluation -INFO- split Overall ... Science Tech␣
˓→& Engineering
0 dev 0.48 ... 0.36 0.428571
1 validation 0.45 ... 0.38 0.371429

**RealWorldQA**

The RealWorldQA dataset is a benchmark designed to evaluate the real-world spatial understanding capabilities of
multimodal AI models. It consists of over 700 images, each accompanied by a question and a veriable answer, focusing
on various real-world scenarios, including those captured from vehicles. This dataset aims to test how well AI models
comprehend physical environments and spatial relations, enhancing their ability to interpret and analyze real-world
scenes.

torchrun --nproc-per-node= 8 run.py --data RealWorldQA --model InternVL-Chat-V1-5 --
˓→verbose

The expected test results are:

2024 - 04 - 29 00: 35 : 13 , 282 - Evaluation -INFO- Score:
2024 - 04 - 29 00: 35 : 13 , 282 - Evaluation -INFO- split Overall
0 none 0.660131

**MMVet (GPT-4-Turbo)**

The MM-Vet dataset is a comprehensive benchmark designed to evaluate the integrated capabilities of MLLMs. It
encompasses six core vision-language (VL) capabilities: recognition, knowledge, optical character recognition (OCR),
spatial awareness, language generation, and math. The dataset includes 200 images and 218 questions, each requir-
ing one or more of these capabilities to answer. The evaluation uses an open-ended LLM-based approach, allowing
assessment across various answer styles and question types.

torchrun --nproc-per-node= 8 run.py --data MMVet --model InternVL-Chat-V1-5 --verbose

The expected test results are:

2024 - 04 - 29 18: 32 : 38 , 748 - Evaluation -INFO- Score:
2024 - 04 - 29 18: 32 : 38 , 748 - Evaluation -INFO- Category tot acc
0 rec 187 61.818182
1 ocr 108 68.981481
2 know 84 46.428571
3 gen 80 44.875000
4 spat 75 63.600000
5 math 26 62.307692
6 Overall 218 61.513761

Note that because the version of GPT-4 used for scoring diers from the ocial server, the scores tested by VLMEvalKit
will be slightly dierent.

**1.30. Evaluation of InternVL 1.5 Series 367**


**LLaVA-Bench (GPT-4-Turbo)**

The LLaVA-Bench-in-the-Wild dataset is designed to evaluate the capabilities of MLLMs in handling more complex
and diverse visual tasks. It includes a set of 24 images with 60 associated questions, covering a range of indoor and
outdoor scenes, memes, paintings, and sketches. Each image is paired with detailed, manually curated descriptions and
questions that test the model’s generalizability to novel domains.

torchrun --nproc-per-node= 8 run.py --data LLaVABench --model InternVL-Chat-V1-5 --verbose

The expected test results are:

"split","Relative Score (main)","VLM Score","GPT4 Score"
"overall","82.0","63.7","77.7"
"conv","82.9","74.1","89.4"
"detail","72.0","48.0","66.7"
"complex","86.0","65.7","76.4"

**VideoMME**

The Video-MME dataset is a comprehensive benchmark designed to evaluate the capabilities of MLLMs in video
analysis. It is therst benchmark specically tailored for this purpose, focusing on a high-quality assessment of models’
performance in processing sequential visual data.

torchrun --nproc-per-node= 8 run.py --data Video-MME --model InternVL-Chat-V1-5 --verbose␣
˓→--nframe 16

The expected test results are:

"overall": {
"overall":"0.5363",
"domain": {
"Knowledge": "0.5679",
"Film & Television": "0.5833",
"Sports Competition": "0.4756",
"Artistic Performance":"0.5444",
"Life Record":"0.5111",
"Multilingual": "0.5111"
},
...
}

**1.30.4 Citation**

If yound this project useful in your research, please consider citing:

@article{chen2024far,
title={How far are we to gpt-4v? closing the gap to commercial multimodal models with␣
˓→open-source suites},
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei␣
˓→and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and␣
˓→others},
journal={Science China Information Sciences},
volume={67},
number={12},
(continues on next page)

**368 Chapter 1. Documentation**


(continued from previous page)
pages={220101},
year={2024},
publisher={Springer}
}
@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
˓→Recognition},
pages={24185--24198},
year={2024}
}

## 1.31 Deploy InternVL 1.5 Series

**1.31.1 LMDeploy**

LMDeployis a toolkit for compressing, deploying, and serving LLMs & VLMs.

pip install lmdeploy>= 0 .5.3

LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-
use pipeline, similar to the Large Language Model (LLM) inference pipeline.

**A ‘Hello, World’ Example**

InternVL-Chat-V1-5

Mini-InternVL-Chat-2B-V1-5

Mini-InternVL-Chat-4B-V1-5

fromlmdeployimport pipeline, TurbomindEngineConfig, ChatTemplateConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL-Chat-V1-5'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
chat_template_config=ChatTemplateConfig('internvl-internlm2')
pipe= pipeline(model, chat_template_config=chat_template_config,
backend_config=TurbomindEngineConfig(session_len= 8192 ))
response= pipe(('describe this image', image))
print(response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig, ChatTemplateConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/Mini-InternVL-Chat-2B-V1-5'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
(continues on next page)

**1.31. Deploy InternVL 1.5 Series 369**


(continued from previous page)
˓→data/tiger.jpeg')
chat_template_config=ChatTemplateConfig('internvl-internlm2')
pipe= pipeline(model, chat_template_config=chat_template_config,
backend_config=TurbomindEngineConfig(session_len= 8192 ))
response= pipe(('describe this image', image))
print(response.text)

fromlmdeployimport pipeline, PytorchEngineConfig, ChatTemplateConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/Mini-InternVL-Chat-4B-V1-5'
image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/
˓→data/tiger.jpeg')
chat_template_config=ChatTemplateConfig('internvl-phi3')
pipe= pipeline(model, chat_template_config=chat_template_config,
backend_config=PytorchEngineConfig(session_len= 8192 ))
response= pipe(('describe this image', image))
print(response.text)

If ImportError occurs while executing this case, please install the required packages as prompted.

**Multi-Images Inference**

When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a
higher number of input tokens, and as a result, the size of the context window typically needs to be increased.

```
Warning: Due to the scarcity of multi-image conversation data, the performance on multi-image tasks may
be unstable, and it may require multiple attempts to achieve satisfactory results.
```
InternVL-Chat-V1-5

Mini-InternVL-Chat-2B-V1-5

Mini-InternVL-Chat-4B-V1-5

fromlmdeployimport pipeline, TurbomindEngineConfig, ChatTemplateConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/InternVL-Chat-V1-5'
chat_template_config=ChatTemplateConfig('internvl-internlm2')
pipe= pipeline(model, chat_template_config=chat_template_config,
backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

**370 Chapter 1. Documentation**


fromlmdeployimport pipeline, TurbomindEngineConfig, ChatTemplateConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/Mini-InternVL-Chat-2B-V1-5'
chat_template_config=ChatTemplateConfig('internvl-internlm2')
pipe= pipeline(model, chat_template_config=chat_template_config,
backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

fromlmdeployimport pipeline, PytorchEngineConfig, ChatTemplateConfig
fromlmdeploy.vlimport load_image
fromlmdeploy.vl.constantsimport IMAGE_TOKEN

model= 'OpenGVLab/Mini-InternVL-Chat-4B-V1-5'
chat_template_config=ChatTemplateConfig('internvl-phi3')
pipe= pipeline(model, chat_template_config=chat_template_config,
backend_config=PytorchEngineConfig(session_len= 8192 ))

image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]

images =[load_image(img_url)forimg_urlinimage_urls]
# Numbering images improves multi-image conversations
response= pipe((f'Image-1:{IMAGE_TOKEN}\nImage-2:{IMAGE_TOKEN}\ndescribe these two␣
˓→images', images))
print(response.text)

**Batch Prompts Inference**

Conducting inference with batch prompts is quite straightforward; just place them within a list structure:

InternVL-Chat-V1-5

Mini-InternVL-Chat-2B-V1-5

Mini-InternVL-Chat-4B-V1-5

fromlmdeployimport pipeline, TurbomindEngineConfig, ChatTemplateConfig
fromlmdeploy.vlimport load_image
(continues on next page)

**1.31. Deploy InternVL 1.5 Series 371**


```
(continued from previous page)
```
model= 'OpenGVLab/InternVL-Chat-V1-5'
chat_template_config=ChatTemplateConfig('internvl-internlm2')
pipe= pipeline(model, chat_template_config=chat_template_config,
backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

fromlmdeployimport pipeline, TurbomindEngineConfig, ChatTemplateConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/Mini-InternVL-Chat-2B-V1-5'
chat_template_config=ChatTemplateConfig('internvl-internlm2')
pipe= pipeline(model, chat_template_config=chat_template_config,
backend_config=TurbomindEngineConfig(session_len= 8192 ))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

fromlmdeployimport pipeline, PytorchEngineConfig, ChatTemplateConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/Mini-InternVL-Chat-4B-V1-5'
chat_template_config=ChatTemplateConfig('internvl-phi3')
pipe= pipeline(model, chat_template_config=chat_template_config,
backend_config=PytorchEngineConfig(session_len= 8192 ))

image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-
˓→pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts=[('describe this image', load_image(img_url))forimg_urlinimage_urls]
response= pipe(prompts)
print(response)

**372 Chapter 1. Documentation**


**Multi-Turn Conversation**

There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the
format of OpenAI and use above introduced method, the other is to use the pipeline.chat interface.

InternVL-Chat-V1-5

Mini-InternVL-Chat-2B-V1-5

Mini-InternVL-Chat-4B-V1-5

fromlmdeployimport pipeline, TurbomindEngineConfig, ChatTemplateConfig,␣
˓→GenerationConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/InternVL-Chat-V1-5'
chat_template_config=ChatTemplateConfig('internvl-internlm2')
pipe= pipeline(model, chat_template_config=chat_template_config,
backend_config=TurbomindEngineConfig(session_len= 8192 ))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

fromlmdeployimport pipeline, TurbomindEngineConfig, ChatTemplateConfig,␣
˓→GenerationConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/Mini-InternVL-Chat-2B-V1-5'
chat_template_config=ChatTemplateConfig('internvl-internlm2')
pipe= pipeline(model, chat_template_config=chat_template_config,
backend_config=TurbomindEngineConfig(session_len= 8192 ))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

fromlmdeployimport pipeline, PytorchEngineConfig, ChatTemplateConfig, GenerationConfig
fromlmdeploy.vlimport load_image

model= 'OpenGVLab/Mini-InternVL-Chat-4B-V1-5'
chat_template_config=ChatTemplateConfig('internvl-phi3')
pipe= pipeline(model, chat_template_config=chat_template_config,
backend_config=PytorchEngineConfig(session_len= 8192 ))

image= load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/
˓→resources/human-pose.jpg')
(continues on next page)

**1.31. Deploy InternVL 1.5 Series 373**


```
(continued from previous page)
```
gen_config =GenerationConfig(top_k= 40 , top_p=0.8, temperature=0.8)
sess= pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess= pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)

**Serving**

**Launch Service**

InternVL-Chat-V1-5

Mini-InternVL-Chat-2B-V1-5

Mini-InternVL-Chat-4B-V1-5

LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

lmdeploy serve api_server OpenGVLab/InternVL-Chat-V1-5 --model-name InternVL-Chat-V1-5 --
˓→backend turbomind --server-port 23333

You can also load 4-bit AWQ quantized models to save memory:

lmdeploy serve api_server OpenGVLab/InternVL-Chat-V1-5-AWQ --model-name InternVL-Chat-V1-
˓→5 --backend turbomind --server-port 23333 --model-format awq

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

lmdeploy serve api_server OpenGVLab/Mini-InternVL-Chat-2B-V1-5 --model-name Mini-
˓→InternVL-Chat-2B-V1-5 --backend turbomind --server-port 23333

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

```
Warning : This model only supports Pytorch backends for now.
```
LMDeploy’s api_server enables models to be easily packed into services with a single command. The provided
RESTful APIs are compatible with OpenAI’s interfaces. Below are an example of service startup:

lmdeploy serve api_server OpenGVLab/Mini-InternVL-Chat-4B-V1-5 --model-name Mini-
˓→InternVL-Chat-4B-V1-5 --backend pytorch --server-port 23333

The arguments of api_server can be viewed through the command lmdeploy serve api_server -h, for
instance, --tp to set tensor parallelism, --session-len to specify the max length of the context window,
--cache-max-entry-count to adjust the GPU mem ratio for k/v cache etc.

**374 Chapter 1. Documentation**


**Integrate with** OpenAI

Here is an example of interaction with the endpoint v1/chat/completions service via the openai package. Before
running it, please install the openai package by pip install openai.

fromopenai import OpenAI

client =OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name =client.models.list().data[ 0 ].id
response= client.chat.completions.create(
model=model_name,
messages=[{
'role':
'user',
'content': [{
'type': 'text',
'text': 'describe this image',
}, {
'type': 'image_url',
'image_url': {
'url':
'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
},
}],
}],
temperature=0.8,
top_p=0.8)
print(response)

If you encounter any issues or need advanced usage with lmdeploy, we recommend reading thelmdeploy documen-
tation.

**Memory Usage Testing**

To test the memory usage with several A100 GPUs, we will consider the following variables: the number of GPUs,
whether AWQ 4-bit quantization is used, and the size of --cache-max-entry-count. The table below shows the
memory usage per GPU under dierent scenarios:

InternVL-Chat-V1-5

Mini-InternVL-Chat-2B-V1-5

Mini-InternVL-Chat-4B-V1-5

```
Warning: It seems InternViT-6B still has some bugs working with --tp.
```
**1.31. Deploy InternVL 1.5 Series 375**


```
#GPUs AWQ 4-bit cache-max-entry-count Memory Usage per GPU
1 No 0.8 77310 MB
1 No 0.2 58302 MB
1 Yes 0.8 72104 MB
1 Yes 0.2 37448 MB
1 Yes 0.1 31656 MB
1 Yes 0.05 28712 MB
2 Yes 0.2 CUDA error
4 Yes 0.2 CUDA error
8 Yes 0.2 CUDA error
```
```
#GPUs AWQ 4-bit cache-max-entry-count Memory Usage per GPU
1 No 0.8 67140 MB
1 No 0.2 21284 MB
1 No 0.1 13700 MB
1 No 0.05 9860 MB
2 No 0.05 8612 MB
4 No 0.05 7916 MB
```
```
#GPUs AWQ 4-bit cache-max-entry-count Memory Usage per GPU
1 No 0.8 67666 MB
1 No 0.2 24914 MB
1 No 0.1 17746 MB
1 No 0.05 14162 MB
2 No 0.05 11700 MB
4 No 0.05 10216 MB
```
## 1.32 Introduction of InternVL-Chat-V1-2

We are excited to introduceInternVL-Chat-V1-2. Inspired byLLaVA-NeXT-34B, we have also adoptedNous-Hermes-
2-Yi-34Bas the language model. Below is the pipeline.

From the experimental results, we’ve observed that **a stronger language model (34B) can better leverage the pow-
erful capabilities of our vision foundation model.**

For better training reproducibility, we follow the minimalist design and data eciency similar to LLaVA-NeXT. To
reduce training costs, we provide apre-trained MLP projectorand only employ around 1.2 million visual instruction
tuning samples for SFT. Our model has a total of 40 billion parameters and can be trained within 1.5 days using 32
A100 GPUs. The code, data, and model have been made publicly available.

Additionally,InternVL-Chat-V1-2-Plususes the same model architecture as InternVL-Chat-V1-2, but the dierence
lies in the SFT dataset. InternVL-Chat-V1-2 only utilizes an SFT dataset with 1.2M samples, while our plus version
employs an SFT dataset with 12M samples.

**376 Chapter 1. Documentation**


**1.32.1 Performance**

* Proprietary Model† Training Set Observed

```
name im-
age
size
```
```
MMMU(val)MMMU(test)Math-
Vista(testmini)
```
```
MMB(test)MM-
BCN(test)
```
```
MMVPMMESci-
enceQA(image)
```
```
POPETextVQA(val)SEEDv1(image)VizWiz(test)GQA(test)
```
### GPT-

### 4V*

```
un-
known
```
### 56.8 55.7 49.9 77.0 74.4 38.7 1409/517- - 78.0 71.6 - -

```
Gemini
Ultra*
```
```
un-
known
```
### 59.4 - 53.0 - - - - - - 82.3 - - -

```
Gemini
Pro*
```
```
un-
known
```
### 47.9 - 45.2 73.6 74.3 40.7 1497/437- - 74.6 70.7 - -

```
Qwen-
VLPlus*
```
```
un-
known
```
### 45.2 40.8 43.3 67.0 70.7 - 1681/502- - 78.9 65.7 - -

```
Qwen-
VL-
Max*
```
```
un-
known
```
### 51.4 46.8 51.0 77.6 75.7 - - - - 79.5 - - -

```
LLa-
VANeXT34B
```
```
672x67251.1 44.7 46.5 79.3 79.0 - 1631/39781.8 87.7 69.5 75.9 63.8 67.1†
```
```
InternVLChatV1-
2
```
```
448x44851.6 46.2 47.7 82.2 81.2 56.7 1687/48983.3 88.0 72.5 75.6 60.0 64.0†
```
```
InternVLChatV1-
2Plus
```
```
448x44850.3 45.6 59.9 83.8 82.0 58.7 1625/55398.1† 88.7 74.1† 76.4 - 66.9†
```
- Note that we use theocial evaluation serverto test the MMVet scores, with GPT-4-0613 serving as the judge
    model. Using dierent versions of GPT-4 as the judge can result in signicant score variations.

Here, we have conducted only a simple performance comparison. For more detailed performance information and
additional evaluation metrics, please refer to our performance summary table.

**1.32.2 Citation**

If yound this project useful in your research, please consider citing:

@article{chen2024far,
title={How far are we to gpt-4v? closing the gap to commercial multimodal models with␣
˓→open-source suites},
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei␣
˓→and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and␣
˓→others},
journal={Science China Information Sciences},
volume={67},
number={12},
pages={220101},
year={2024},
publisher={Springer}
}
@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
(continues on next page)

**1.32. Introduction of InternVL-Chat-V1-2 377**


(continued from previous page)
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
˓→Recognition},
pages={24185--24198},
year={2024}
}

## 1.33 Quick Start of InternVL-Chat-V1-2

```
Please use transformers>=4.37.2 to ensure the model works normally.
```
**1.33.1 Model Preparation**

```
model name type param download size
InternVL-Chat-V1-2 MLLM 40.1B HF link 75.0 GB
InternVL-Chat-V1-2-Plus MLLM 40.1B HF link 75.0 GB
```
Use the following commands to download the desired model:

cd pretrained/
# pip install -U huggingface_hub
# Download OpenGVLab/InternVL-Chat-V1-2
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL-Chat-V1-2 --local-dir InternVL-Chat-V1-2
# Download OpenGVLab/InternVL-Chat-V1-2-Plus
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL-Chat-V1-2-Plus --local-dir InternVL-Chat-V1-2-Plus

The directory structure is:

pretrained
InternVL-Chat-V1-2
InternVL-Chat-V1-2-Plus

**1.33.2 Model Loading**

**16-bit (bf16 / fp16)**

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL-Chat-V1-2-Plus"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()

**378 Chapter 1. Documentation**


**BNB 8-bit Quantization**

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL-Chat-V1-2-Plus"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

**BNB 4-bit Quantization**

```
Warning: Due to signicant quantization errors with BNB 4-bit quantization on InternViT-6B, the model
may produce nonsensical outputs and fail to understand images. Therefore, please avoid using BNB 4-bit
quantization.
```
**Multiple GPUs**

The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not
being on the same device. By ensuring that therst and last layers of the large language model (LLM) are on the same
device, we prevent such errors.

import math
import torch
fromtransformersimport AutoTokenizer, AutoModel

defsplit_model(model_name):
device_map= {}
world_size= torch.cuda.device_count()
num_layers= {'InternVL-Chat-V1-2': 60 ,'InternVL-Chat-V1-2-Plus': 60 }[model_name]
# Since the first GPU will be used for ViT, treat it as half a GPU.
num_layers_per_gpu= math.ceil(num_layers /(world_size-0.5))
num_layers_per_gpu= [num_layers_per_gpu]*world_size
num_layers_per_gpu[ 0 ]=math.ceil(num_layers_per_gpu[ 0 ] *0.5)
layer_cnt= 0
fori, num_layerin enumerate(num_layers_per_gpu):
forjin range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] =i
layer_cnt+= 1
device_map['vision_model']= 0
device_map['mlp1']= 0
device_map['language_model.model.tok_embeddings']= 0
device_map['language_model.model.embed_tokens']= 0
device_map['language_model.output']= 0
device_map['language_model.model.norm']= 0
device_map['language_model.lm_head']= 0
device_map[f'language_model.model.layers.{num_layers- 1 }']= 0

```
return device_map
```
path= "OpenGVLab/InternVL-Chat-V1-2-Plus"
(continues on next page)

**1.33. Quick Start of InternVL-Chat-V1-2 379**


```
(continued from previous page)
```
device_map =split_model('InternVL-Chat-V1-2-Plus')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

**1.33.3 Inference with Transformers**

**Pure-text conversation**

fromtransformersimport AutoTokenizer, AutoModel
import torch

path= "OpenGVLab/InternVL-Chat-V1-2-Plus"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer= AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

generation_config= dict(max_new_tokens= 1024 , do_sample=False)
question= 'Hello, who are you?'
response, history= model.chat(tokenizer, None, question, generation_config,␣
˓→history=None, return_history=True)
print(f'User:{question}')
print(f'Assistant:{response}')

question= 'Can you tell me a story?'
response, history= model.chat(tokenizer, None, question, generation_config,␣
˓→history=history, return_history=True)
print(f'User:{question}')
print(f'Assistant:{response}')

**Single-image single-round conversation**

fromtransformersimport AutoTokenizer, AutoModel, CLIPImageProcessor
fromPILimport Image
import torch

path= "OpenGVLab/InternVL-Chat-V1-2-Plus"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
(continues on next page)

**380 Chapter 1. Documentation**


```
(continued from previous page)
```
tokenizer= AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

image_processor =CLIPImageProcessor.from_pretrained(path)
image= Image.open('./examples/image2.jpg').resize(( 448 , 448 ))
pixel_values=image_processor(images=image, return_tensors='pt').pixel_values.to(torch.
˓→bfloat16).cuda()

generation_config= dict(max_new_tokens= 1024 , do_sample=False)
question= '<image>\nPlease describe the image shortly.'
response= model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User:{question}')
print(f'Assistant:{response}')

**Single-image multi-round conversation**

fromtransformersimport AutoTokenizer, AutoModel, CLIPImageProcessor
fromPILimport Image
import torch

path= "OpenGVLab/InternVL-Chat-V1-2-Plus"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer= AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

image_processor =CLIPImageProcessor.from_pretrained(path)
image= Image.open('./examples/image2.jpg').resize(( 448 , 448 ))
pixel_values=image_processor(images=image, return_tensors='pt').pixel_values.to(torch.
˓→bfloat16).cuda()

generation_config= dict(max_new_tokens= 1024 , do_sample=False)
question= '<image>\nPlease describe the image in detail.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,␣
˓→history=None, return_history=True)
print(f'User:{question}')
print(f'Assistant:{response}')

question= 'Please write a poem according to the image.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,␣
˓→history=history, return_history=True)
print(f'User:{question}')
print(f'Assistant:{response}')

**Multi-image multi-round conversation, combined images**

```
Warning: Please note that for this model, we support multi-image chat in the interface, but the results are
not very good due to the lack of training with multi-image data.
```
**1.33. Quick Start of InternVL-Chat-V1-2 381**


fromtransformersimport AutoTokenizer, AutoModel, CLIPImageProcessor
fromPILimport Image
import torch

path= "OpenGVLab/InternVL-Chat-V1-2-Plus"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer= AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

image_processor =CLIPImageProcessor.from_pretrained(path)
image1 =Image.open('./examples/image1.jpg').resize(( 448 , 448 ))
pixel_values1= image_processor(images=image1, return_tensors='pt').pixel_values.
˓→to(torch.bfloat16).cuda()
image2 =Image.open('./examples/image2.jpg').resize(( 448 , 448 ))
pixel_values2= image_processor(images=image2, return_tensors='pt').pixel_values.
˓→to(torch.bfloat16).cuda()
pixel_values=torch.cat((pixel_values1, pixel_values2), dim= 0 )

generation_config= dict(max_new_tokens= 1024 , do_sample=False)
question= '<image>\nDescribe the two images in detail.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
history=None, return_history=True)
print(f'User:{question}')
print(f'Assistant:{response}')

question= 'What are the similarities and differences between these two images.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
history=history, return_history=True)
print(f'User:{question}')
print(f'Assistant:{response}')

**Multi-image multi-round conversation, separate images**

```
Warning: Please note that for this model, we support multi-image chat in the interface, but the results are
not very good due to the lack of training with multi-image data.
```
fromtransformersimport AutoTokenizer, AutoModel, CLIPImageProcessor
fromPILimport Image
import torch

path= "OpenGVLab/InternVL-Chat-V1-2-Plus"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer= AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

```
(continues on next page)
```
**382 Chapter 1. Documentation**


```
(continued from previous page)
```
image_processor =CLIPImageProcessor.from_pretrained(path)
image1 =Image.open('./examples/image1.jpg').resize(( 448 , 448 ))
pixel_values1= image_processor(images=image1, return_tensors='pt').pixel_values.
˓→to(torch.bfloat16).cuda()
image2 =Image.open('./examples/image2.jpg').resize(( 448 , 448 ))
pixel_values2= image_processor(images=image2, return_tensors='pt').pixel_values.
˓→to(torch.bfloat16).cuda()
pixel_values=torch.cat((pixel_values1, pixel_values2), dim= 0 )
num_patches_list=[pixel_values1.size( 0 ), pixel_values2.size( 0 )]

generation_config= dict(max_new_tokens= 1024 , do_sample=False)
question= 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list, history=None, return_
˓→history=True)
print(f'User:{question}')
print(f'Assistant:{response}')

question= 'What are the similarities and differences between these two images.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list, history=history,␣
˓→return_history=True)
print(f'User:{question}')
print(f'Assistant:{response}')

**Batch inference, single image per sample**

fromtransformersimport AutoTokenizer, AutoModel, CLIPImageProcessor
fromPILimport Image
import torch

path= "OpenGVLab/InternVL-Chat-V1-2-Plus"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer= AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

image_processor =CLIPImageProcessor.from_pretrained(path)
image1 =Image.open('./examples/image1.jpg').resize(( 448 , 448 ))
pixel_values1= image_processor(images=image1, return_tensors='pt').pixel_values.
˓→to(torch.bfloat16).cuda()
image2 =Image.open('./examples/image2.jpg').resize(( 448 , 448 ))
pixel_values2= image_processor(images=image2, return_tensors='pt').pixel_values.
˓→to(torch.bfloat16).cuda()
pixel_values=torch.cat((pixel_values1, pixel_values2), dim= 0 )
num_patches_list=[pixel_values1.size( 0 ), pixel_values2.size( 0 )]

generation_config= dict(max_new_tokens= 1024 , do_sample=False)
(continues on next page)

**1.33. Quick Start of InternVL-Chat-V1-2 383**


```
(continued from previous page)
```
questions= ['<image>\nDescribe the image in detail.'] *len(num_patches_list)
responses= model.batch_chat(tokenizer, pixel_values,
num_patches_list=num_patches_list,
questions=questions,
generation_config=generation_config)
forquestion, responsein zip(questions, responses):
print(f'User:{question}')
print(f'Assistant:{response}')

**Video multi-round conversation**

```
Warning: Please note that for this model, we support video chat in the interface, but the results are not
very good due to the lack of training with video data.
```
fromtransformersimport AutoTokenizer, AutoModel, CLIPImageProcessor
fromdecord import VideoReader, cpu
fromPILimport Image
import numpyas np
import torch

defget_index(bound, fps, max_frame, first_idx= 0 , num_segments= 32 ):
if bound:
start, end= bound[ 0 ], bound[ 1 ]
else:
start, end= - 100000 , 100000
start_idx= max(first_idx,round(start *fps))
end_idx=min(round(end* fps), max_frame)
seg_size=float(end_idx-start_idx) /num_segments
frame_indices= np.array([
int(start_idx+ (seg_size/ 2 )+np.round(seg_size *idx))
foridxin range(num_segments)
])
return frame_indices

defload_video(video_path, bound=None, num_segments= 32 ):
vr =VideoReader(video_path, ctx=cpu( 0 ), num_threads= 1 )
max_frame= len(vr) - 1
fps=float(vr.get_avg_fps())

```
pixel_values_list, num_patches_list= [], []
image_processor =CLIPImageProcessor.from_pretrained(path)
frame_indices= get_index(bound, fps, max_frame, first_idx= 0 , num_segments=num_
˓→segments)
forframe_indexinframe_indices:
img=Image.fromarray(vr[frame_index].asnumpy()).convert('RGB').resize(( 448 ,␣
˓→ 448 ))
pixel_values=image_processor(images=img, return_tensors='pt').pixel_values
num_patches_list.append(pixel_values.shape[ 0 ])
pixel_values_list.append(pixel_values)
pixel_values=torch.cat(pixel_values_list)
return pixel_values, num_patches_list
(continues on next page)
```
**384 Chapter 1. Documentation**


```
(continued from previous page)
```
path= "OpenGVLab/InternVL-Chat-V1-2-Plus"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer= AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

generation_config= dict(max_new_tokens= 1024 , do_sample=False)

video_path ='./examples/red-panda.mp4'
pixel_values, num_patches_list= load_video(video_path, num_segments= 8 )
pixel_values=pixel_values.to(torch.bfloat16).cuda()
video_prefix=''.join([f'Frame{i+ 1 }: <image>\n' foriin range(len(num_patches_list))])
question= video_prefix+'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list, history=None, return_
˓→history=True)
print(f'User:{question}')
print(f'Assistant:{response}')

question= 'Describe this video in detail.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list, history=history,␣
˓→return_history=True)
print(f'User:{question}')
print(f'Assistant:{response}')

**Streaming Output**

Besides this method, you can also use the following code to get streamed output.

fromtransformersimport TextIteratorStreamer
fromthreading importThread

# Initialize the streamer
streamer= TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True,␣
˓→timeout= 10 )
# Define the generation configuration
generation_config= dict(max_new_tokens= 1024 , do_sample=False, streamer=streamer)
# Start the model chat in a separate thread
thread =Thread(target=model.chat, kwargs=dict(
tokenizer=tokenizer, pixel_values=pixel_values, question=question,
history=None, return_history=False, generation_config=generation_config,
))
thread.start()

# Initialize an empty string to store the generated text
(continues on next page)

**1.33. Quick Start of InternVL-Chat-V1-2 385**


```
(continued from previous page)
```
generated_text= ''
# Loop through the streamer to get the new text as it is generated
fornew_textin streamer:
if new_text== model.conv_template.sep:
break
generated_text+= new_text
print(new_text, end='', flush=True) # Print each new chunk of generated text on the␣
˓→same line

**1.33.4 Citation**

If yound this project useful in your research, please consider citing:

@article{chen2024far,
title={How far are we to gpt-4v? closing the gap to commercial multimodal models with␣
˓→open-source suites},
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei␣
˓→and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and␣
˓→others},
journal={Science China Information Sciences},
volume={67},
number={12},
pages={220101},
year={2024},
publisher={Springer}
}
@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
˓→Recognition},
pages={24185--24198},
year={2024}
}

## 1.34 Reproduce InternVL-Chat-V1-2

Here, we provide all the necessary code, data, and models to reproduce InternVL-Chat-V1-2. Please follow the guide-
lines below for preparation.

**1.34.1 Model Preparation**

```
model name type param download size
InternViT-6B-448px-V1-2 ViT 5.5B HF link 11.1 GB
Nous-Hermes-2-Yi-34B LLM 34.4B HF link 65.0 GB
```
If you want to replicate the training of InternVL-Chat-V1-2, please follow the commands below to download

**386 Chapter 1. Documentation**


InternViT-6B-448px-V1-2 and Nous-Hermes-2-Yi-34B.

cd pretrained/
# pip install -U huggingface_hub
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternViT-6B-448px-V1-2 --local-dir InternViT-6B-448px-V1-2
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/Nous-
˓→Hermes-2-Yi-34B --local-dir Nous-Hermes-2-Yi-34B

The directory structure is:

pretrained
InternViT-6B-448px-V1-2
Nous-Hermes-2-Yi-34B

**1.34.2 Training Datasets Preparation**

Inspired by LLaVA-NeXT, we adopted a data-ecient SFT strategy to train InternVL-Chat-V1-2, utilizing approxi-
mately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we
build uponShareGPT-4Vand additionally integrateLLaVA-ZH,DVQA,ChartQA,AI2D,DocVQA,GeoQA+, and
SynthDoG-EN. Most of the data remains consistent with LLaVA-NeXT.

**Preferred Method: Download from HuggingFace**

To simplify the dataset preparation, we recommend downloading the complete dataset directly from HuggingFace. This
method is straightforward and ensures you have all the necessary data in one place.

- Download the entire dataset:InternVL-Chat-V1-2-SFT-Data

**Alternative Method: Manual Download**

If you prefer, you can manually download the annotationles and images as detailed below.

First, download theannotationlesand place them in the playground/opensource/ folder.

Second, download all the images we used.

- AI2D:ai2d_images(provided by InternLM-XComposer)
- ChartQA:ChartQA Dataset
- COCO:train2017
- DocVQA:train,val,test
- DVQA:images
- GQA:images
- LLaVA-Pretrain:images
- OCR-VQA:download script. We save allles as .jpg
- SAM: We only use 000000~000050.tar for now. You can quickly download 9K images fromhere.
- TextVQA:trainvalimages
- SynthDoG-EN: We only use 00000~00004 parquet les for now, with a total of 30K images. We provide the
    convertedimages.
- VisualGenome:part1,part2
- WebData:images. Only for academic usage.

**1.34. Reproduce InternVL-Chat-V1-2 387**


- GeoQA+:images. We have converted the data format and redistributed it.
    **Warning:** Note that in the sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.jsonl
le, the format of the RefCOCO data is consistent with LLaVA 1.5, which is [x1, y1, x2, y2] with
coordinates ranging from 0-1. During the training of InternVL-Chat-V1-2, we did not apply any spe-
cial processing to this format. However, for the training of InternVL-Chat-V1-2-Plus, we converted the
coordinate format to <box>[[x1, y1, x2, y2]]</box> and adjusted the coordinate range to 0-1000.

Then, organize the data as follows in playground/data:

playground/
opensource
ai2d_train_12k.jsonl
chartqa_train_18k.jsonl
docvqa_train_10k.jsonl
dvqa_train_200k.jsonl
geoqa+.jsonl
llava_instruct_150k_zh.jsonl
sharegpt4v_instruct_gpt4-vision_cap100k.jsonl
sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.jsonl
synthdog_en.jsonl
data
ai2d
abc_images
images
chartqa
test
train
val
coco
train2017
docvqa
test
train
val
dvqa
images
gqa
images
llava
llava_pretrain
images
ocr_vqa
images
sam
images
share_textvqa
images
synthdog-en
images
textvqa
train_images
vg
VG_100K
(continues on next page)

**388 Chapter 1. Documentation**


```
(continued from previous page)
VG_100K_2
web-celebrity
images
web-landmark
images
wikiart
images
geoqa+
images
```
**1.34.3 Start Training**

We provide slurm scripts for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If
you use 64 GPUs, training will take approximately 18 hours.

- If you encounter an OOM error, you can decrease the PER_DEVICE_BATCH_SIZE, for example, set
    PER_DEVICE_BATCH_SIZE=4.

# using 32 GPUs
PARTITION='your partition'GPUS= 32 PER_DEVICE_BATCH_SIZE= 8 sh shell/internvl1.2/hermes2_
˓→yi34b/internvl_chat_v1_2_hermes2_yi34b_448_res_finetune.sh
# using 64 GPUs
PARTITION='your partition'GPUS= 64 PER_DEVICE_BATCH_SIZE= 8 sh shell/internvl1.2/hermes2_
˓→yi34b/internvl_chat_v1_2_hermes2_yi34b_448_res_finetune.sh

The hyperparameters used for ne-tuning are listed in the following table. And, you can view the training logs in
tensorboard athere.

```
Hyperparameter Trainable
param
```
```
Global batch
size
```
```
Learning
rate
```
```
Epoch Max
length
```
```
Weight de-
cay
InternVL-Chat-
V1-2
```
```
40B 512 1e-5 1 2048 0.05
```
**1.34.4 Citation**

If yound this project useful in your research, please consider citing:

@article{chen2024far,
title={How far are we to gpt-4v? closing the gap to commercial multimodal models with␣
˓→open-source suites},
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei␣
˓→and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and␣
˓→others},
journal={Science China Information Sciences},
volume={67},
number={12},
pages={220101},
year={2024},
publisher={Springer}
}
@inproceedings{chen2024internvl,
(continues on next page)

**1.34. Reproduce InternVL-Chat-V1-2 389**


(continued from previous page)
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
˓→Recognition},
pages={24185--24198},
year={2024}
}

## 1.35 Fine-tune on a Custom Dataset

**1.35.1 Model Preparation**

```
model name type param download size
InternVL-Chat-V1-2 MLLM 40.1B HF link 75.0 GB
InternVL-Chat-V1-2-Plus MLLM 40.1B HF link 75.0 GB
```
Before starting the second ne-tuning, download the pre-trained model we provide. Two versions are available:
InternVL-Chat-V1-2andInternVL-Chat-V1-2-Plus. We recommend downloading the Plus version.

Use the following commands to download the desired model:

cd pretrained/
# pip install -U huggingface_hub
# Download OpenGVLab/InternVL-Chat-V1-2
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL-Chat-V1-2 --local-dir InternVL-Chat-V1-2
# Download OpenGVLab/InternVL-Chat-V1-2-Plus
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL-Chat-V1-2-Plus --local-dir InternVL-Chat-V1-2-Plus

The directory structure is:

pretrained
InternVL-Chat-V1-2
InternVL-Chat-V1-2-Plus

**1.35.2 Prepare Customized Data**

After downloading the pre-trained model, prepare your customized SFT (Supervised Fine-Tuning) data. Create a JSON
le in internvl_chat/shell/data/ similar tothis example.

The format for the JSONle should be:

{
"your-custom-dataset-1": {
"root": "path/to/the/image/",
"annotation":"path/to/the/jsonl/annotation",
(continues on next page)

**390 Chapter 1. Documentation**


(continued from previous page)
"data_augment": false,
"max_dynamic_patch": 12 ,
"repeat_time": 1 ,
"length":"number of samples in the dataset"
}
}

Example:

{
"sharegpt4v_instruct_gpt4-vision_cap100k": {
"root": "playground/data/",
"annotation":"playground/opensource/sharegpt4v_instruct_gpt4-vision_cap100k.jsonl",
"data_augment": false,
"max_dynamic_patch": 12 ,
"repeat_time": 1 ,
"length": 102025
}
}

The format for each specic JSONL (such as plain text data, single-image data, multi-image data, video data) can be
organized according to the descriptions provided in _this document_.

My suggestion is to add new domain-specic data on top of the _general data from our open-sourced InternVL 1.2_.
This will enhance downstream capabilities while retaining the foundational skills. Of course, you can also choose to
ne-tune solely on the new data based on your requirements.

**1.35.3 Start 2nd Fine-tuning**

Fine-tune the pre-trained models using either thescript for training the full LLMor thescript for training the LoRA
adapter, depending on your available GPU resources.

Before ne-tuning, set the --meta_path to the path of the JSON le you created in the previous step. The default
pre-trained model path in these shell scripts is ./pretrained/InternVL-Chat-V1-2-Plus.

```
Fine-tuning the full LLM requires 16 A100 80G GPUs, whereasne-tuning the LoRA requires 2 A100
80G GPUs.
The number of GPUs and hyperparameters used here are just an example. To achieve optimal results, you
may need to adjust these settings based on your available hardware and dataset size.
```
Commands forne-tuning:

# Using 16 GPUs with SLURM system, fine-tune the full LLM, cost about 80G per GPU
PARTITION='your partition'GPUS= 16 sh shell/internvl1.2/2nd_finetune/internvl_chat_v1_2_
˓→hermes2_yi34b_448_res_2nd_finetune_full.sh
# Using 2 GPUs, fine-tune the LoRA, without SLURM system, cost about 63G per GPU
GPUS= 2 sh shell/internvl1.2/2nd_finetune/internvl_chat_v1_2_hermes2_yi34b_448_res_2nd_
˓→finetune_lora.sh

If you encounter any issues, please let me know, and I will update the training guide to enhance its usability.

**1.35. Fine-tune on a Custom Dataset 391**


**1.35.4 Citation**

If yound this project useful in your research, please consider citing:

@article{chen2024far,
title={How far are we to gpt-4v? closing the gap to commercial multimodal models with␣
˓→open-source suites},
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei␣
˓→and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and␣
˓→others},
journal={Science China Information Sciences},
volume={67},
number={12},
pages={220101},
year={2024},
publisher={Springer}
}
@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
˓→Recognition},
pages={24185--24198},
year={2024}
}

## 1.36 Evaluation of InternVL-Chat-V1-2

To evaluate the performance of the InternVL-Chat-V1-2-Plus model across various tasks, follow the instructions for
each specic dataset. Ensure that the appropriate number of GPUs is allocated as specied.

```
1 We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. For certain datasets
like MMVet and LLaVA-Bench, dierent GPT-4 versions used as judges cause signicant result discrep-
ancies between two codebases.
2 Please note that evaluating the same model using dierent testing toolkits like InternVL and VLMEvalKit
can result in slight dierences, which is normal. Updates to code versions and variations in environment
and hardware can also cause minor discrepancies in results.
3 Note, the dataset description is generated by GPT-4 and may contain errors.
```
**1.36.1 Model Preparation**

```
model name type param download size
InternVL-Chat-V1-2 MLLM 40.1B HF link 75.0 GB
InternVL-Chat-V1-2-Plus MLLM 40.1B HF link 75.0 GB
```
Use the following commands to download the desired model:

**392 Chapter 1. Documentation**


cd pretrained/
# pip install -U huggingface_hub
# Download OpenGVLab/InternVL-Chat-V1-2
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL-Chat-V1-2 --local-dir InternVL-Chat-V1-2
# Download OpenGVLab/InternVL-Chat-V1-2-Plus
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL-Chat-V1-2-Plus --local-dir InternVL-Chat-V1-2-Plus

The directory structure is:

pretrained
InternVL-Chat-V1-2
InternVL-Chat-V1-2-Plus

**1.36.2 Evaluation using InternVL Codebase**

**Data Preparation**

Please prepare the evaluation data according to the _guidance provided here_.

### MME

MME is a comprehensive benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on both
perception and cognition abilities across 14 dierent subtasks, ensuring robust and diverse testing of these models.

Please use the following command to perform the test with 1 GPU:

GPUS= 1 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus mme

The expected test results are:

===========Perception===========
total score:1614.0419167667067

```
existence score:190.0
count score:155.0
position score:178.33333333333331
color score:180.0
posters score: 184.69387755102042
celebrity score:176.76470588235293
scene score:157.0
landmark score:164.0
artwork score: 118.25
OCR score:110.0
```
===========Cognition===========
total score:558.2142857142858

```
commonsense_reasoning score:155.71428571428572
numerical_calculation score:132.5
text_translation score: 185.0
code_reasoning score:85.0
```
**1.36. Evaluation of InternVL-Chat-V1-2 393**


### OKVQA

OKVQA (Outside Knowledge Visual Question Answering) is a dataset designed for visual question answering tasks
that require external knowledge beyond what is visible in the image, featuring over 14,000 questions to evaluate the
reasoning abilities of AI models.

Please use the following command to perform the test with 8 GPU:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus vqa-okvqa-val

The expected test results are:

okvqa_val0.6763864447086718

**TextVQA**

TextVQA is a dataset designed to evaluate visual question answering models by requiring them to read and reason
about text present within images, containing 45,336 questions over 28,408 images from the OpenImages dataset.

The TextVQA dataset provides ocial OCR results, specically Rosetta OCR tokens. During testing with InstructBLIP
and LLaVA 1.5, the OCR results are input to the LLM as a prompt. If you want to input Rosetta OCR tokens, use the
following command:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus vqa-textvqa-val-ocr

The expected test results are:

textvqa_val_ocr 0.7410400000000032

If you do not want to input Rosetta OCR tokens, use this command:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus vqa-textvqa-val

The expected test results are:

textvqa_val0.7118800000000035

**VizWiz**

The VizWiz VQA dataset is a visual question answering dataset created to help answer visual questions posed by blind
individuals. It contains over 31,000 visual questions, where users took a picture using a mobile phone and recorded a
spoken question about it. Each question comes with 10 crowdsourced answers. This dataset addresses tasks such as
predicting the answer to a visual question and determining whether a visual question can be answered.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus vqa-vizwiz-val

The expected test results are:

vizwiz_val 0.6134950914563562

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus vqa-vizwiz-test

**394 Chapter 1. Documentation**


For the test set, submit the results to theevaluation server.

The expected test results are:

vizwiz_test0.595

**ChartQA**

The ChartQA dataset is a comprehensive benchmark for question answering about charts that involves both visual and
logical reasoning. It includes a mix of 9.6K human-written questions and 23.1K machine-generated questions derived
from chart summaries. This dataset is designed to evaluate models that can understand and analyze charts by answering
complex questions that often require multiple logical and arithmetic operations, as well as referencing visual features
of the charts.

The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. Thenal score
for model evaluation is calculated as the average of the scores on these two test sets:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus vqa-chartqa-test

The expected test results are:

['chartqa_test_human', {'relaxed_accuracy':0.5772}]
['chartqa_test_augmented', {'relaxed_accuracy':0.8796}]
average score= (57.72+ 87.96)/ 2 = 72.8

**DocVQA**

The DocVQA dataset consists of 50,000 questions on 12,000+ document images. It is designed for visual question
answering tasks where questions are answered using text within the document images. The dataset includes OCR
transcriptions and ground truth answers, supporting evaluation of models that interpret and extract information from
documents.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus vqa-docvqa-val

The expected test results are:

Overall ANLS:0.5689

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus vqa-docvqa-test

For the test set, submit the results to theevaluation server.

The expected test results are:

Overall ANLS:0.5680

### AI2D

The AI2D dataset contains over 5,000 grade school science diagrams with extensive annotations and 15,000 multiple-
choice questions for research on diagram understanding and question answering.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus vqa-ai2d-test

**1.36. Evaluation of InternVL-Chat-V1-2 395**


The expected test results are:

ai2diagram_test {'accuracy': 0.7888031088082902}

**InfographicVQA**

The InfographicVQA dataset is a collection of infographics accompanied by natural language questions and answers.
This dataset includes a diverse range of infographics sourced from thousands of dierent websites, ensuring a variety
of layouts and designs. It comprises 30,035 questions across 5,485 images, split into training, validation, and test sets.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus vqa-infovqa-val

The expected test results are:

Overall ANLS:0.4093

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus vqa-infovqa-test

For the test set, submit the results to theevaluation server.

The expected test results are:

Overall ANLS:0.406

### GQA

The GQA dataset is a large-scale visual question answering dataset designed for real-world visual reasoning and com-
positional question answering. It contains over 22 million questions grounded in real images, each accompanied by
detailed scene graphs that describe objects, their attributes, and relationships within the scene. The dataset includes im-
ages from the Visual Genome dataset, with questions that require various reasoning skills such as spatial understanding
and multi-step inference.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus vqa-gqa-testdev

The expected test results are:

Accuracy:66.91%

**ScienceQA**

The ScienceQA dataset is a large-scale benchmark for multimodal science question answering, consisting of 21,208
multiple-choice questions derived from elementary and high school science curricula. This dataset features a diverse
range of topics across natural science, social science, and language science. It includes questions with image context
(48.7%), text context (48.2%), and both (30.8%).

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus scienceqa

The expected test results are:

Acc@ 1 : 0.9806727813584531

**396 Chapter 1. Documentation**


### POPE

The POPE (Polling-based Object Probing Evaluation) dataset is designed to evaluate object hallucination in MLLMs.
The dataset consists of 3,000 questions related to the captions of 500 images. By treating the MLLMs’ answers to these
questions as a binary classication task, the dataset allows researchers to measure accuracy, precision, recall, and F1
scores to determine the extent of hallucination in the models.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus pope

The expected test results are:

Category: random,# samples: 2910
TP FP TN FN
1230 18 1392 270
Accuracy:0.9010309278350516
Precision: 0.9855769230769231
Recall:0.82
F1 score:0.8951965065502183
Yes ratio: 0.4288659793814433
0.895, 0.901,0.986, 0.820,0.429
====================================
Category: popular,# samples: 3000
TP FP TN FN
1230 42 1458 270
Accuracy:0.896
Precision: 0.9669811320754716
Recall:0.82
F1 score:0.8874458874458875
Yes ratio: 0.424
0.887, 0.896,0.967, 0.820,0.424
====================================
Category: adversarial,# samples: 3000
TP FP TN FN
1230 77 1423 270
Accuracy:0.8843333333333333
Precision: 0.9410864575363428
Recall:0.82
F1 score:0.8763804773779836
Yes ratio: 0.43566666666666665
0.876, 0.884,0.941, 0.820,0.436
====================================

(89.5+ 88.7+87.6) / 3 =88.6

**Tiny LVLM**

The Tiny LVLM-eHub is a streamlined evaluation benchmark designed to assess the multimodal capabilities of
MLLMs, including models like Bard. It focuses on six categories of multimodal abilities: visual perception, visual
knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus tiny_lvlm

The expected test results are:

**1.36. Evaluation of InternVL-Chat-V1-2 397**


Visual_Knowledge_Acquisition:0.75
Object_Hallucination:0.89
Visual_Commonsense: 0.638
Visual_Perception:0.5625
Visual_Reasoning:0.6909090909090909
Overall:3.53909090909091

### MMMU

The MMMU dataset is a comprehensive benchmark designed to evaluate multimodal models on college-level tasks that
require domain-specic knowledge and reasoning. It includes 11,500 questions sourced from college exams, quizzes,
and textbooks, spanning six disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social
Science, and Tech & Engineering. These questions cover 30 subjects and feature 30 types of images, such as charts,
diagrams, maps, tables, and more.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus mmmu-val

The expected test results are:

{'Overall-Art and Design': {'num': 120 ,'acc': 0.542}, 'Art': {'num': 30 ,'acc': 0.667},
˓→'Art_Theory': {'num': 30 ,'acc':0.633}, 'Design': {'num': 30 ,'acc': 0.7}, 'Music': {
˓→'num': 30 ,'acc': 0.167},'Overall-Business': {'num': 150 ,'acc': 0.46},'Accounting':
˓→{'num': 30 ,'acc': 0.567}, 'Economics': {'num': 30 ,'acc':0.467}, 'Finance': {'num':␣
˓→ 30 ,'acc':0.367}, 'Manage': {'num': 30 , 'acc': 0.367},'Marketing': {'num': 30 ,'acc
˓→':0.533}, 'Overall-Science': {'num': 150 ,'acc':0.38}, 'Biology': {'num': 30 , 'acc':␣
˓→0.4}, 'Chemistry': {'num': 30 , 'acc':0.2}, 'Geography': {'num': 30 ,'acc':0.6},'Math
˓→': {'num': 30 , 'acc': 0.4}, 'Physics': {'num': 30 ,'acc': 0.3},'Overall-Health and␣
˓→Medicine': {'num': 150 ,'acc': 0.573},'Basic_Medical_Science': {'num': 30 ,'acc': 0.5}
˓→, 'Clinical_Medicine': {'num': 30 , 'acc':0.633}, 'Diagnostics_and_Laboratory_Medicine
˓→': {'num': 30 , 'acc': 0.467},'Pharmacy': {'num': 30 , 'acc': 0.533},'Public_Health': {
˓→'num': 30 ,'acc': 0.733},'Overall-Humanities and Social Science': {'num': 120 , 'acc':␣
˓→0.708},'History': {'num': 30 , 'acc':0.7}, 'Literature': {'num': 30 , 'acc':0.833},
˓→'Sociology': {'num': 30 ,'acc': 0.7},'Psychology': {'num': 30 ,'acc': 0.6},'Overall-
˓→Tech and Engineering': {'num': 210 , 'acc':0.419},'Agriculture': {'num': 30 ,'acc':0.
˓→ 433 }, 'Architecture_and_Engineering': {'num': 30 ,'acc': 0.4}, 'Computer_Science': {
˓→'num': 30 ,'acc': 0.467},'Electronics': {'num': 30 , 'acc':0.233}, 'Energy_and_Power
˓→': {'num': 30 , 'acc': 0.567},'Materials': {'num': 30 ,'acc': 0.367}, 'Mechanical_
˓→Engineering': {'num': 30 ,'acc':0.467}, 'Overall': {'num': 900 ,'acc':0.5}}

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus mmmu-test

Then submit the results to theevaluation server. The expected test results are:

All subject resultes
{'Overall-Art & Design': {'num': 1163 ,'acc': 0.595},'Art': {'num': 231 ,'acc': 0.658},
˓→'Art_Theory': {'num': 429 , 'acc':0.648}, 'Design': {'num': 169 ,'acc':0.822}, 'Music
˓→': {'num': 334 ,'acc': 0.368}, 'Overall-Business': {'num': 1428 ,'acc':0.405},
˓→'Accounting': {'num': 380 , 'acc':0.453}, 'Economics': {'num': 267 ,'acc': 0.449},
˓→'Finance': {'num': 355 ,'acc': 0.324},'Manage': {'num': 245 , 'acc':0.347},'Marketing
˓→': {'num': 181 ,'acc': 0.475}, 'Overall-Science': {'num': 2426 ,'acc': 0.38},'Biology
(continues on next page)

**398 Chapter 1. Documentation**


```
(continued from previous page)
˓→': {'num': 345 ,'acc': 0.412}, 'Chemistry': {'num': 603 , 'acc':0.31}, 'Geography': {
˓→'num': 565 ,'acc': 0.444}, 'Math': {'num': 505 , 'acc':0.392}, 'Physics': {'num': 408 ,
˓→'acc':0.353}, 'Overall-Health & Medicine': {'num': 1752 , 'acc':0.501},'Basic_
˓→Medical_Science': {'num': 326 , 'acc':0.586}, 'Clinical_Medicine': {'num': 325 , 'acc':␣
˓→0.542},'Diagnostics_and_Laboratory_Medicine': {'num': 162 ,'acc': 0.475}, 'Pharmacy':
˓→{'num': 430 ,'acc':0.493}, 'Public_Health': {'num': 509 , 'acc':0.434},'Overall-
˓→Humanities & Social Science': {'num': 947 ,'acc':0.713}, 'History': {'num': 278 ,'acc
˓→':0.752}, 'Literature': {'num': 112 ,'acc': 0.866}, 'Sociology': {'num': 252 , 'acc':␣
˓→0.714},'Psychology': {'num': 305 , 'acc':0.62}, 'Overall-Tech & Engineering': {'num':␣
˓→ 2784 , 'acc':0.377},'Agriculture': {'num': 287 , 'acc':0.355},'Architecture_and_
˓→Engineering': {'num': 551 , 'acc':0.312}, 'Computer_Science': {'num': 371 , 'acc':0.
˓→ 412 }, 'Electronics': {'num': 256 ,'acc': 0.305}, 'Energy_and_Power': {'num': 432 ,'acc
˓→':0.41}, 'Materials': {'num': 458 , 'acc':0.356},'Mechanical_Engineering': {'num':␣
˓→ 429 , 'acc':0.476}, 'Overall': {'num': 10500 ,'acc': 0.456}}
```
Leaderboard
[{'test_split': {'Art & Design': 0.595,'Business':0.405, 'Science':0.38, 'Health &␣
˓→Medicine':0.501, 'Humanities & Social Science': 0.713,'Tech & Engineering':0.377,
˓→'Overall':0.456}}]

**MMVet (GPT-4-0613)**

```
Warning: Here, we use GPT-4-0613 as the judge model, while in VLMEvalKit, GPT-4-Turbo is used
as the judge model. Using dierent versions of GPT-4 can result in signicant score variations. Therefore,
testing the same model with the two codebases can lead to notable score dierences.
```
The MM-Vet dataset is a comprehensive benchmark designed to evaluate the integrated capabilities of MLLMs. It
encompasses six core vision-language (VL) capabilities: recognition, knowledge, optical character recognition (OCR),
spatial awareness, language generation, and math. The dataset includes 200 images and 218 questions, each requir-
ing one or more of these capabilities to answer. The evaluation uses an open-ended LLM-based approach, allowing
assessment across various answer styles and question types.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus mmvet

Then, submit the results to theevaluation server. The expected test results are:

runs: [47.9]

**MMBench**

The MMBench dataset is a comprehensive multi-modality benchmark designed to evaluate thene-grained abilities of
vision-language models. It contains around 3,000 multiple-choice questions covering 20 ability dimensions, structured
into a hierarchical taxonomy. These dimensions include perception and reasoning abilities, further broken down into
specic skills like coarse andne-grained perception, attribute reasoning, and logic reasoning.

For the English dev / test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus mmbench-dev-en
GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus mmbench-test-en

Then, submit the results to theevaluation server. The expected test results are:

mmbench-dev-en:83.4
mmbench-test-en:83.8

**1.36. Evaluation of InternVL-Chat-V1-2 399**


For the Chinese dev / test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus mmbench-dev-cn
GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus mmbench-test-cn

Then, submit the results to theevaluation server. The expected test results are:

mmbench-dev-cn:81.6
mmbench-test-cn:82.0

**CCBench**

CCBench, a multi-modal benchmark in the domain of Chinese Culture, is designed to evaluate the performance of
MLLMs on tasks specically related to Chinese cultural content.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus ccbench-dev

Then, submit the results to theevaluation server. The expected test results are:

ccbench-dev:55.9

### SEED

CCBench is a multimodal benchmark specically designed to evaluate models on tasks related to Chinese culture. It
is part of the larger MMBench suite of benchmarks, developed by the OpenCompass Community, and aims to provide
ne-grained evaluations across various capabilities of vision-language models. CCBench includes 510 questions in a
multiple-choice format, focusing on cultural knowledge and understanding.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus seed

The expected test results are:

DatatypeScene Understanding:80.24%
DatatypeInstance Identity: 79.90%
DatatypeInstance Location: 77.95%
DatatypeInstance Attributes:71.37%
DatatypeInstances Counting: 72.25%
DatatypeSpatial Relation:63.01%
DatatypeInstance Interaction:77.32%
DatatypeVisual Reasoning:79.46%
DatatypeText Understanding: 47.67%
DatatypeAction Recognition: 49.11%
DatatypeAction Prediction: 41.80%
DatatypeProcedure Understanding: 52.59%
Total accuracy: 70.43%
Image accuracy: 76.44%
Video accuracy: 47.67%

### MMVP

The MMVP dataset is designed to benchmark the performance of multimodal large language models (MLLMs) in
visual question answering tasks. This dataset focuses on identifying “CLIP-blind pairs,” which are images that appear
similar to the CLIP model despite having clear visual dierences. The MMVP dataset includes 300 images derived
from ImageNet-1k and LAION-Aesthetics, each paired with straightforward questions to evaluate the models’ visual

**400 Chapter 1. Documentation**


capabilities. It highlights the challenges these systems face, often leading to incorrect responses and hallucinated
explanations.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus mmvp

The expected test results are:

Evaluating MMVP ...
Results saved to results/MMVP_240727004726.jsonl
The accuracyis 0.5866666666666667

**LLaVA-Bench (GPT-4-0613)**

```
Warning: Here, we use GPT-4-0613 as the judge model, while in VLMEvalKit, GPT-4-Turbo is used
as the judge model. Using dierent versions of GPT-4 can result in signicant score variations. Therefore,
testing the same model with the two codebases can lead to notable score dierences.
```
The LLaVA-Bench-in-the-Wild dataset is designed to evaluate the capabilities of MLLMs in handling more complex
and diverse visual tasks. It includes a set of 24 images with 60 associated questions, covering a range of indoor and
outdoor scenes, memes, paintings, and sketches. Each image is paired with detailed, manually curated descriptions and
questions that test the model’s generalizability to novel domains.

export OPENAI_API_KEY='your openai key'
GPUS= 1 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus llava-bench

The expected test results are:

all*85.0* 87.0 73.9
llava_bench_complex [8.75,7.429] 84.9
llava_bench_complex 84.9 87.5 74.3
llava_bench_conv [8.824, 7.706]87.3
llava_bench_conv87.3 88.2 77.1
llava_bench_detail [8.467,6.967] 82.3
llava_bench_detail82.3 84.7 69.7

**MathVista**

The MathVista dataset is a comprehensive benchmark for evaluating mathematical reasoning within visual contexts. It
consists of three newly created datasets—IQTest, FunctionQA, and PaperQA—designed to address logical reasoning
on puzzle testgures, algebraic reasoning over functional plots, and scientic reasoning with academic papergures,
respectively.

export OPENAI_API_KEY='your-openai-key'
GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus mathvista-testmini

The expected test results are:

Correct: 597 , Total: 1000 , Accuracy:59.7%
1000
Number of test problems: 1000

Type: [question_type]
[free_form]:52.39% ( 241 / 460 )
[multi_choice]: 65.93%( 356 / 540 )

```
(continues on next page)
```
**1.36. Evaluation of InternVL-Chat-V1-2 401**


```
(continued from previous page)
```
Type: [answer_type]
[float]: 0.00%( 0 / 40 )
[integer]: 57.42%( 240 / 418 )
[text]:65.93%( 356 / 540 )
[list]: 50.00%( 1 / 2 )

Type: [language]
[english]: 58.33%( 546 / 936 )
[chinese]: 82.26%( 51 / 62 )
[persian]: 0.00% ( 0 / 2 )

**RefCOCO Series**

RefCOCO, RefCOCO+, and RefCOCOg are datasets used for tasks involving referring expression comprehension,
segmentation, and generation. These datasets are built upon the MSCOCO dataset, and they are essential for evaluating
models in natural language processing and computer vision.

GPUS= 8 sh evalulate.sh pretrained/InternVL-Chat-V1-2-Plus refcoco

The expected test results are:

RefCOCO val, 90.2
RefCOCO testA, 93.4
RefCOCO testB, 85.5
RefCOCO+ val, 85.3
RefCOCO+ testA, 90.4
RefCOCO+ testB, 79.7
RefCOCO-g val, 88.5
RefCOCO-g test, 88.8

**MVBench**

MVBench is a comprehensive multimodal video understanding benchmark developed to evaluate the temporal com-
prehension capabilities of MLLMs. It includes 20 challenging video tasks that require temporal understanding and
cannot be eectively solved using a single frame. The benchmark uses a novel static-to-dynamic method, transforming
static tasks into dynamic ones to systematically generate video tasks that demand a wide range of temporal skills, from
perception to cognition.

We evaluate our models on MVBench by extracting 16 frames from each video, and each frame was resized to a
448x448 image.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-2-Plus mvbench --load-in-8bit

The expected test results are:

{'Action Sequence': 54.50000000000001,'Action Prediction': 54.0,'Action Antonym':48.5,
'Fine-grained Action':39.0, 'Unexpected Action': 79.0, 'Object Existence': 48.5,
'Object Interaction':62.5,'Object Shuffle': 39.5,'Moving Direction': 35.5,
'Action Localization':32.5, 'Scene Transition':88.0, 'Action Count':42.0, 'Moving␣
˓→Count':38.0,
'Moving Attribute': 60.5,'State Change': 47.0,'Fine-grained Pose': 53.5,'Character␣
˓→Order':68.5,
(continues on next page)

**402 Chapter 1. Documentation**


```
(continued from previous page)
```
'Egocentric Navigation': 28.999999999999996,'Episodic Reasoning':63.5,
˓→'Counterfactual Inference': 36.0,'Avg': 50.975}

**1.36.3 Evaluation using VLMEvalKit Codebase**

**Data Preparation**

VLMEvalKit will automatically download the data for evaluation, so you do not need to prepare it manually.

**MathVista**

The MathVista dataset is a comprehensive benchmark for evaluating mathematical reasoning within visual contexts. It
consists of three newly created datasets—IQTest, FunctionQA, and PaperQA—designed to address logical reasoning
on puzzle testgures, algebraic reasoning over functional plots, and scientic reasoning with academic papergures,
respectively.

torchrun --nproc-per-node= 8 run.py --data MathVista_MINI --model InternVL-Chat-V1-2-Plus␣
˓→--verbose

The expected test results are:

-- --------------------------- ---- --- --- ------- -------
0 Overall 1000 665 575 66.5 57.5
1 scientific reasoning 122 94 68 77.0492 55.7377
2 textbook question answering 158 112 80 70.8861 50.6329
3 numeric commonsense 144 66 66 45.8333 45.8333
4 arithmetic reasoning 353 210 224 59.4901 63.4561
5 visual question answering 179 99 96 55.3073 53.6313
6 geometry reasoning 239 151 129 63.1799 53.9749
7 algebraic reasoning 281 177 131 62.9893 46.6192
8 geometry problem solving 208 136 110 65.3846 52.8846
9 math word problem 186 145 152 77.957 81.7204
10 logical reasoning 37 22 5 59.4595 13.5135
11 figure question answering 269 173 137 64.3123 50.9294
12 statistical reasoning 301 203 187 67.4419 62.1262
-- --------------------------- ---- --- --- ------- -------

**HallusionBench**

HallusionBench is a comprehensive benchmark designed to evaluate image-context reasoning in MLLMs, focusing on
identifying issues related to language hallucination and visual illusion. The dataset consists of 346 images paired with
1,129 questions crafted by human experts. These questions are divided into two categories: Visual Dependent (VD)
and Visual Supplement (VS), allowing the benchmark to assess the nuanced understanding and interpretation of visual
data by MLLMs.

torchrun --nproc-per-node= 8 run.py --data HallusionBench --model InternVL-Chat-V1-2-Plus␣
˓→--verbose

The expected test results are:

-- ----------- ------- ------- -------
0 Overall 65.51 41.6185 37.1429
1 VD 62.9442 41.7391 32.13
(continues on next page)

**1.36. Evaluation of InternVL-Chat-V1-2 403**


(continued from previous page)
2 VS 69.7222 41.3793 44.9438
3 VD_ocr 79.7753 65.1163 60.4651
4 VS_chart 70 37.5 56.5789
5 VD_figure 73.75 60.9756 46.1538
6 VS_map 59.375 40.9091 18.75
7 VD_illusion 61.1111 38.7097 27.7778
8 VS_table 80.3571 53.5714 55.814
9 VD_math 54.6296 22.2222 29.6296
10 VS_ocr 59.2593 34.6154 25.9259
11 VD_video 55.8824 22.9167 13.0435
-- ----------- ------- ------- -------

result =(65.51 +41.6185+37.1429) / 3 = 48.1

**MMStar**

The MMStar dataset is an advanced multimodal benchmark designed to evaluate the capabilities of MLLMs. It com-
prises 1,500 carefully selected samples that are balanced and puried to ensure they exhibit visual dependency and
minimal data leakage. The dataset evaluates models across six core capabilities and 18 detailed axes, focusing on
complex multimodal tasks that require advanced reasoning and understanding of visual content.

torchrun --nproc-per-node= 8 run.py --data MMStar --model InternVL-Chat-V1-2-Plus --
˓→verbose

The expected test results are:

----------------------- -----
split none
Overall 0.604
coarse perception 0.676
fine-grained perception 0.528
instance reasoning 0.676
logical reasoning 0.616
math 0.712
science&technology 0.416
----------------------- -----

**OCRBench**

OCRBench is a comprehensive evaluation benchmark designed to assess the OCR capabilities of MLLMs. It includes
ve components: Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA,
Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). The benchmark
encompasses data from 29 datasets, making it one of the most thorough OCR evaluation tools available. OCRBench
aims to reveal both the strengths and weaknesses of MLLMs, particularly in handling multilingual text, handwritten text,
non-semantic text, and mathematical expressions. The benchmark includes 1,000 question-answer pairs, all manually
veried for precision.

torchrun --nproc-per-node= 8 run.py --data OCRBench --model InternVL-Chat-V1-2-Plus --
˓→verbose

The expected test results are:

**404 Chapter 1. Documentation**


### {

"Text Recognition": 255 ,
"Scene Text-centric VQA": 164 ,
"Doc-oriented VQA": 92 ,
"Key Information Extraction": 82 ,
"Handwritten Mathematical Expression Recognition": 5 ,
"Final Score": 598 ,
"Final Score Norm": 59.8
}

### MMMU

The MMMU dataset is a comprehensive benchmark designed to evaluate multimodal models on college-level tasks that
require domain-specic knowledge and reasoning. It includes 11,500 questions sourced from college exams, quizzes,
and textbooks, spanning six disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social
Science, and Tech & Engineering. These questions cover 30 subjects and feature 30 types of images, such as charts,
diagrams, maps, tables, and more.

torchrun --nproc-per-node= 8 run.py --data MMMU_DEV_VAL --model InternVL-Chat-V1-2-Plus --
˓→verbose

The expected test results are:

----------------------------------- ------------------- -------------------
split validation dev
Overall 0.5188888888888888 0.52
Accounting 0.5333333333333333 0.6
Agriculture 0.43333333333333335 0.2
Architecture_and_Engineering 0.3333333333333333 0.2
Art 0.6666666666666666 1.0
Art_Theory 0.6666666666666666 0.8
Basic_Medical_Science 0.5333333333333333 1.0
Biology 0.3333333333333333 0.8
Chemistry 0.4 0.0
Clinical_Medicine 0.6333333333333333 0.6
Computer_Science 0.5666666666666667 0.6
Design 0.7333333333333333 0.6
Diagnostics_and_Laboratory_Medicine 0.4666666666666667 0.4
Economics 0.5 0.2
Electronics 0.36666666666666664 0.4
Energy_and_Power 0.5666666666666667 0.6
Finance 0.5 0.2
Geography 0.5666666666666667 0.2
History 0.7 1.0
Literature 0.8333333333333334 0.6
Manage 0.5666666666666667 0.6
Marketing 0.5 0.4
Materials 0.36666666666666664 0.4
Math 0.36666666666666664 0.6
Mechanical_Engineering 0.4666666666666667 0.6
Music 0.16666666666666666 0.4
Pharmacy 0.5333333333333333 0.6
Physics 0.3 0.2
(continues on next page)

**1.36. Evaluation of InternVL-Chat-V1-2 405**


```
(continued from previous page)
```
Psychology 0.6 0.8
Public_Health 0.7 0.4
Sociology 0.6666666666666666 0.6
Art&Design 0.5583333333333333 0.7
Business 0.52 0.4
Health &Medicine 0.5733333333333334 0.6
Humanities &Social Science 0.7 0.75
Science 0.3933333333333333 0.36
Tech& Engineering 0.44285714285714284 0.42857142857142855
----------------------------------- ------------------- -------------------

**RealWorldQA**

The RealWorldQA dataset is a benchmark designed to evaluate the real-world spatial understanding capabilities of
multimodal AI models. It consists of over 700 images, each accompanied by a question and a veriable answer, focusing
on various real-world scenarios, including those captured from vehicles. This dataset aims to test how well AI models
comprehend physical environments and spatial relations, enhancing their ability to interpret and analyze real-world
scenes.

torchrun --nproc-per-node= 8 run.py --data RealWorldQA --model InternVL-Chat-V1-2-Plus --
˓→verbose

The expected test results are:

------- ------------------
split none
Overall 0.6775882352941176
------- ------------------

**MMVet (GPT-4-Turbo)**

The MM-Vet dataset is a comprehensive benchmark designed to evaluate the integrated capabilities of MLLMs. It
encompasses six core vision-language (VL) capabilities: recognition, knowledge, optical character recognition (OCR),
spatial awareness, language generation, and math. The dataset includes 200 images and 218 questions, each requir-
ing one or more of these capabilities to answer. The evaluation uses an open-ended LLM-based approach, allowing
assessment across various answer styles and question types.

torchrun --nproc-per-node= 8 run.py --data MMVet --model InternVL-Chat-V1-2-Plus --verbose

The expected test results are:

- ------- --- -------
0 rec 187 50.8556
1 ocr 108 50.7407
2 know 84 35.7143
3 gen 80 34.5
4 spat 75 47.6
5 math 26 18.8462
6 Overall 218 47.156
- ------- --- -------

Note that because the version of GPT-4 used for scoring diers from the ocial server, the scores tested by VLMEvalKit
will be slightly dierent.

**406 Chapter 1. Documentation**


**LLaVA-Bench (GPT-4-Turbo)**

The LLaVA-Bench-in-the-Wild dataset is designed to evaluate the capabilities of MLLMs in handling more complex
and diverse visual tasks. It includes a set of 24 images with 60 associated questions, covering a range of indoor and
outdoor scenes, memes, paintings, and sketches. Each image is paired with detailed, manually curated descriptions and
questions that test the model’s generalizability to novel domains.

torchrun --nproc-per-node= 8 run.py --data LLaVABench --model InternVL-Chat-V1-2-Plus --
˓→verbose

The expected test results are:

- ------- ---- ---- ----
0 overall*76.4*59.3 77.7
1 complex 75.2 59.6 79.3
2 conv 86.3 74.1 85.9
3 detail 64.3 42 65.3
- ------- ---- ---- ----

**1.36.4 Citation**

If yound this project useful in your research, please consider citing:

@article{chen2024far,
title={How far are we to gpt-4v? closing the gap to commercial multimodal models with␣
˓→open-source suites},
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei␣
˓→and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and␣
˓→others},
journal={Science China Information Sciences},
volume={67},
number={12},
pages={220101},
year={2024},
publisher={Springer}
}
@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
˓→Recognition},
pages={24185--24198},
year={2024}
}

## 1.37 Introduction of InternVL-Chat-V1-1

We releasedInternVL-Chat-V1-1, featuring a structure similar to LLaVA, including a ViT, an MLP projector, and an
LLM. As shown in thegure below, we connected our InternViT-6B to LLaMA2-13B through a simple MLP projector.
Note that the LLaMA2-13B used here is not the original model but an internal chat version obtained by incrementally

**1.37. Introduction of InternVL-Chat-V1-1 407**


pre-training andne-tuning the LLaMA2-13B base model for Chinese language tasks. Overall, our model has a total
of 19 billion parameters.

In this version, we explored increasing the resolution to 448× 448, enhancing OCR capabilities, and improving support
for Chinese conversations. Since the 448× 448 input image generates 1024 visual tokens after passing through the ViT,
leading to a signicant computational burden, we use a pixel shue (unshue) operation to reduce the 1024 tokens to
256 tokens.

For more detailed information about this model, please read ourblog.

**1.37.1 Performance**

```
model LLaVA-1.5 InternVL-Chat-V1-0 InternVL-Chat-V1-0 InternVL-Chat-V1-1
resolution 336 336 448 448
vision encoder CLIP-L-336px InternViT-6B-224px InternViT-6B-448px InternViT-6B-448px
language model Vicuna-13B Vicuna-13B Vicuna-13B LLaMA2-13B
```
```
VQAv2testdev 80.0 80.2 82.0 80.9
GQAtestdev 63.3 63.9 64.1 62.5
VizWiztest 53.6 54.6 60.1 57.3
SQAtest 71.6 70.1 71.6 90.1
TextVQAval, w/o OCR - - - 64.2
TextVQAval, w/ OCR 61.3 58.7 64.8 68.6
POPE 85.9 87.1 87.2 87.1
MMEperception 1531.3 1546.9 1579.0 1659.8
MMB-ENtest 67.7 66.5 68.2 75.4
MMB-CNtest 63.6 61.9 64.0 70.3
MMVetGPT-4-0613 35.4 33.7 36.7 46.7
```
- Note that we use theocial evaluation serverto test the MMVet scores, with GPT-4-0613 serving as the judge
    model. Using dierent versions of GPT-4 as the judge can result in signicant score variations.

Here, we have conducted only a simple performance comparison. For more detailed performance information and
additional evaluation metrics, please refer to our performance summary table.

**1.37.2 Citation**

If yound this project useful in your research, please consider citing:

@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
˓→Recognition},
pages={24185--24198},
year={2024}
}

**408 Chapter 1. Documentation**


## 1.38 Quick Start of InternVL-Chat-V1-1

```
Please use transformers>=4.37.2 to ensure the model works normally.
```
**1.38.1 Model Preparation**

```
model name type param download size
InternVL-Chat-V1-1 MLLM 19.1B HF link 35.0 GB
```
Download the above model weights and place them in the pretrained/ folder.

cd pretrained/
# pip install -U huggingface_hub
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL-Chat-V1-1 --local-dir InternVL-Chat-V1-1

The directory structure is:

pretrained
InternVL-Chat-V1-1

**1.38.2 Model Loading**

**16-bit (bf16 / fp16)**

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL-Chat-V1-1"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()

**BNB 8-bit Quantization**

import torch
fromtransformersimport AutoTokenizer, AutoModel
path= "OpenGVLab/InternVL-Chat-V1-1"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()

**1.38. Quick Start of InternVL-Chat-V1-1 409**


**BNB 4-bit Quantization**

```
Warning: Due to signicant quantization errors with BNB 4-bit quantization on InternViT-6B, the model
may produce nonsensical outputs and fail to understand images. Therefore, please avoid using BNB 4-bit
quantization.
```
**Multiple GPUs**

The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not
being on the same device. By ensuring that therst and last layers of the large language model (LLM) are on the same
device, we prevent such errors.

import math
import torch
fromtransformersimport AutoTokenizer, AutoModel

defsplit_model(model_name):
device_map= {}
world_size= torch.cuda.device_count()
num_layers= {'InternVL-Chat-V1-1': 40 }[model_name]
# Since the first GPU will be used for ViT, treat it as half a GPU.
num_layers_per_gpu= math.ceil(num_layers /(world_size-0.5))
num_layers_per_gpu= [num_layers_per_gpu]*world_size
num_layers_per_gpu[ 0 ]=math.ceil(num_layers_per_gpu[ 0 ] *0.5)
layer_cnt= 0
fori, num_layerin enumerate(num_layers_per_gpu):
forjin range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] =i
layer_cnt+= 1
device_map['vision_model']= 0
device_map['mlp1']= 0
device_map['language_model.model.tok_embeddings']= 0
device_map['language_model.model.embed_tokens']= 0
device_map['language_model.output']= 0
device_map['language_model.model.norm']= 0
device_map['language_model.lm_head']= 0
device_map[f'language_model.model.layers.{num_layers- 1 }']= 0

```
return device_map
```
path= "OpenGVLab/InternVL-Chat-V1-1"
device_map =split_model('InternVL-Chat-V1-1')
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()

**410 Chapter 1. Documentation**


**1.38.3 Inference with Transformers**

**Pure-text conversation**

fromtransformersimport AutoTokenizer, AutoModel
import torch

path= "OpenGVLab/InternVL-Chat-V1-1"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer= AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

generation_config= dict(max_new_tokens= 1024 , do_sample=False)
question= 'Hello, who are you?'
response, history= model.chat(tokenizer, None, question, generation_config,␣
˓→history=None, return_history=True)
print(f'User:{question}')
print(f'Assistant:{response}')

question= 'Can you tell me a story?'
response, history= model.chat(tokenizer, None, question, generation_config,␣
˓→history=history, return_history=True)
print(f'User:{question}')
print(f'Assistant:{response}')

**Single-image single-round conversation**

fromtransformersimport AutoTokenizer, AutoModel, CLIPImageProcessor
fromPILimport Image
import torch

path= "OpenGVLab/InternVL-Chat-V1-1"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer= AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

image_processor =CLIPImageProcessor.from_pretrained(path)
image= Image.open('./examples/image2.jpg').resize(( 448 , 448 ))
pixel_values=image_processor(images=image, return_tensors='pt').pixel_values.to(torch.
˓→bfloat16).cuda()

generation_config= dict(max_new_tokens= 1024 , do_sample=False)
question= '<image>\nPlease describe the image shortly.'
response= model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User:{question}')
print(f'Assistant:{response}')

**1.38. Quick Start of InternVL-Chat-V1-1 411**


**Single-image multi-round conversation**

fromtransformersimport AutoTokenizer, AutoModel, CLIPImageProcessor
fromPILimport Image
import torch

path= "OpenGVLab/InternVL-Chat-V1-1"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer= AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

image_processor =CLIPImageProcessor.from_pretrained(path)
image= Image.open('./examples/image2.jpg').resize(( 448 , 448 ))
pixel_values=image_processor(images=image, return_tensors='pt').pixel_values.to(torch.
˓→bfloat16).cuda()

generation_config= dict(max_new_tokens= 1024 , do_sample=False)
question= '<image>\nPlease describe the image in detail.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,␣
˓→history=None, return_history=True)
print(f'User:{question}')
print(f'Assistant:{response}')

question= 'Please write a poem according to the image.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,␣
˓→history=history, return_history=True)
print(f'User:{question}')
print(f'Assistant:{response}')

**Multi-image multi-round conversation, combined images**

```
Warning: Please note that for this model, we support multi-image chat in the interface, but the results are
not very good due to the lack of training with multi-image data.
```
fromtransformersimport AutoTokenizer, AutoModel, CLIPImageProcessor
fromPILimport Image
import torch

path= "OpenGVLab/InternVL-Chat-V1-1"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer= AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

image_processor =CLIPImageProcessor.from_pretrained(path)
image1 =Image.open('./examples/image1.jpg').resize(( 448 , 448 ))
(continues on next page)

**412 Chapter 1. Documentation**


```
(continued from previous page)
```
pixel_values1= image_processor(images=image1, return_tensors='pt').pixel_values.
˓→to(torch.bfloat16).cuda()
image2 =Image.open('./examples/image2.jpg').resize(( 448 , 448 ))
pixel_values2= image_processor(images=image2, return_tensors='pt').pixel_values.
˓→to(torch.bfloat16).cuda()
pixel_values=torch.cat((pixel_values1, pixel_values2), dim= 0 )

generation_config= dict(max_new_tokens= 1024 , do_sample=False)
question= '<image>\nDescribe the two images in detail.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
history=None, return_history=True)
print(f'User:{question}')
print(f'Assistant:{response}')

question= 'What are the similarities and differences between these two images.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
history=history, return_history=True)
print(f'User:{question}')
print(f'Assistant:{response}')

**Multi-image multi-round conversation, separate images**

```
Warning: Please note that for this model, we support multi-image chat in the interface, but the results are
not very good due to the lack of training with multi-image data.
```
fromtransformersimport AutoTokenizer, AutoModel, CLIPImageProcessor
fromPILimport Image
import torch

path= "OpenGVLab/InternVL-Chat-V1-1"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer= AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

image_processor =CLIPImageProcessor.from_pretrained(path)
image1 =Image.open('./examples/image1.jpg').resize(( 448 , 448 ))
pixel_values1= image_processor(images=image1, return_tensors='pt').pixel_values.
˓→to(torch.bfloat16).cuda()
image2 =Image.open('./examples/image2.jpg').resize(( 448 , 448 ))
pixel_values2= image_processor(images=image2, return_tensors='pt').pixel_values.
˓→to(torch.bfloat16).cuda()
pixel_values=torch.cat((pixel_values1, pixel_values2), dim= 0 )
num_patches_list=[pixel_values1.size( 0 ), pixel_values2.size( 0 )]

generation_config= dict(max_new_tokens= 1024 , do_sample=False)
question= 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list, history=None, return_
(continues on next page)

**1.38. Quick Start of InternVL-Chat-V1-1 413**


(continued from previous page)
˓→history=True)
print(f'User:{question}')
print(f'Assistant:{response}')

question= 'What are the similarities and differences between these two images.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list, history=history,␣
˓→return_history=True)
print(f'User:{question}')
print(f'Assistant:{response}')

**Batch inference, single image per sample**

fromtransformersimport AutoTokenizer, AutoModel, CLIPImageProcessor
fromPILimport Image
import torch

path= "OpenGVLab/InternVL-Chat-V1-1"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer= AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

image_processor =CLIPImageProcessor.from_pretrained(path)
image1 =Image.open('./examples/image1.jpg').resize(( 448 , 448 ))
pixel_values1= image_processor(images=image1, return_tensors='pt').pixel_values.
˓→to(torch.bfloat16).cuda()
image2 =Image.open('./examples/image2.jpg').resize(( 448 , 448 ))
pixel_values2= image_processor(images=image2, return_tensors='pt').pixel_values.
˓→to(torch.bfloat16).cuda()
pixel_values=torch.cat((pixel_values1, pixel_values2), dim= 0 )
num_patches_list=[pixel_values1.size( 0 ), pixel_values2.size( 0 )]

generation_config= dict(max_new_tokens= 1024 , do_sample=False)
questions= ['<image>\nDescribe the image in detail.'] *len(num_patches_list)
responses= model.batch_chat(tokenizer, pixel_values,
num_patches_list=num_patches_list,
questions=questions,
generation_config=generation_config)
forquestion, responsein zip(questions, responses):
print(f'User:{question}')
print(f'Assistant:{response}')

**Video multi-round conversation**

```
Warning: Please note that for this model, we support video chat in the interface, but the results are not
very good due to the lack of training with video data.
```
**414 Chapter 1. Documentation**


fromtransformersimport AutoTokenizer, AutoModel, CLIPImageProcessor
fromdecord import VideoReader, cpu
fromPILimport Image
import numpyas np
import torch

defget_index(bound, fps, max_frame, first_idx= 0 , num_segments= 32 ):
if bound:
start, end= bound[ 0 ], bound[ 1 ]
else:
start, end= - 100000 , 100000
start_idx= max(first_idx,round(start *fps))
end_idx=min(round(end* fps), max_frame)
seg_size=float(end_idx-start_idx) /num_segments
frame_indices= np.array([
int(start_idx+ (seg_size/ 2 )+np.round(seg_size *idx))
foridxin range(num_segments)
])
return frame_indices

defload_video(video_path, bound=None, num_segments= 32 ):
vr =VideoReader(video_path, ctx=cpu( 0 ), num_threads= 1 )
max_frame= len(vr) - 1
fps=float(vr.get_avg_fps())

```
pixel_values_list, num_patches_list= [], []
image_processor =CLIPImageProcessor.from_pretrained(path)
frame_indices= get_index(bound, fps, max_frame, first_idx= 0 , num_segments=num_
˓→segments)
forframe_indexinframe_indices:
img=Image.fromarray(vr[frame_index].asnumpy()).convert('RGB').resize(( 448 ,␣
˓→ 448 ))
pixel_values=image_processor(images=img, return_tensors='pt').pixel_values
num_patches_list.append(pixel_values.shape[ 0 ])
pixel_values_list.append(pixel_values)
pixel_values=torch.cat(pixel_values_list)
return pixel_values, num_patches_list
```
path= "OpenGVLab/InternVL-Chat-V1-1"
model= AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer= AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

generation_config= dict(max_new_tokens= 1024 , do_sample=False)

video_path ='./examples/red-panda.mp4'
pixel_values, num_patches_list= load_video(video_path, num_segments= 8 )
(continues on next page)

**1.38. Quick Start of InternVL-Chat-V1-1 415**


```
(continued from previous page)
```
pixel_values=pixel_values.to(torch.bfloat16).cuda()
video_prefix=''.join([f'Frame{i+ 1 }: <image>\n' foriin range(len(num_patches_list))])
question= video_prefix+'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list, history=None, return_
˓→history=True)
print(f'User:{question}')
print(f'Assistant:{response}')

question= 'Describe this video in detail.'
response, history= model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list, history=history,␣
˓→return_history=True)
print(f'User:{question}')
print(f'Assistant:{response}')

**Streaming Output**

Besides this method, you can also use the following code to get streamed output.

fromtransformersimport TextIteratorStreamer
fromthreading importThread

# Initialize the streamer
streamer= TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True,␣
˓→timeout= 10 )
# Define the generation configuration
generation_config= dict(max_new_tokens= 1024 , do_sample=False, streamer=streamer)
# Start the model chat in a separate thread
thread =Thread(target=model.chat, kwargs=dict(
tokenizer=tokenizer, pixel_values=pixel_values, question=question,
history=None, return_history=False, generation_config=generation_config,
))
thread.start()

# Initialize an empty string to store the generated text
generated_text= ''
# Loop through the streamer to get the new text as it is generated
fornew_textin streamer:
if new_text== model.conv_template.sep:
break
generated_text+= new_text
print(new_text, end='', flush=True) # Print each new chunk of generated text on the␣
˓→same line

**1.38.4 Citation**

If yound this project useful in your research, please consider citing:

@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
(continues on next page)

**416 Chapter 1. Documentation**


(continued from previous page)
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
˓→Recognition},
pages={24185--24198},
year={2024}
}

## 1.39 Evaluation of InternVL-Chat-V1-1

To evaluate the performance of the InternVL-Chat-V1-1 model across various tasks, follow the instructions for each
specic dataset. Ensure that the appropriate number of GPUs is allocated as specied.

```
1 We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. For certain datasets
like MMVet and LLaVA-Bench, dierent GPT-4 versions used as judges cause signicant result discrep-
ancies between two codebases.
2 Please note that evaluating the same model using dierent testing toolkits like InternVL and VLMEvalKit
can result in slight dierences, which is normal. Updates to code versions and variations in environment
and hardware can also cause minor discrepancies in results.
3 Note, the dataset description is generated by GPT-4 and may contain errors.
```
**1.39.1 Model Preparation**

```
model name type param download size
InternVL-Chat-V1-1 MLLM 19.1B HF link 35.0 GB
```
Download the above model weights and place them in the pretrained/ folder.

cd pretrained/
# pip install -U huggingface_hub
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL-Chat-V1-1 --local-dir InternVL-Chat-V1-1

The directory structure is:

pretrained
InternVL-Chat-V1-1

**1.39.2 Evaluation using InternVL Codebase**

**Data Preparation**

Please prepare the evaluation data according to the _guidance provided here_.

**1.39. Evaluation of InternVL-Chat-V1-1 417**


### MME

MME is a comprehensive benchmark designed to evaluate Multimodal Large Language Models (MLLMs) on both
perception and cognition abilities across 14 dierent subtasks, ensuring robust and diverse testing of these models.

Please use the following command to perform the test with 1 GPU:

GPUS= 1 sh evaluate.sh pretrained/InternVL-Chat-V1-1 mme

The expected test results are:

===========Perception===========
total score:1664.5088035214085

```
existence score:185.0
count score:173.33333333333334
position score:163.33333333333334
color score:190.0
posters score: 161.22448979591837
celebrity score:149.11764705882354
scene score:153.5
landmark score:167.5
artwork score: 144.0
OCR score:177.5
```
===========Cognition===========
total score:360.7142857142857

```
commonsense_reasoning score:130.71428571428572
numerical_calculation score:70.0
text_translation score: 110.0
code_reasoning score:50.0
```
### OKVQA

OKVQA (Outside Knowledge Visual Question Answering) is a dataset designed for visual question answering tasks
that require external knowledge beyond what is visible in the image, featuring over 14,000 questions to evaluate the
reasoning abilities of AI models.

Please use the following command to perform the test with 8 GPU:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 vqa-okvqa-val

The expected test results are:

okvqa_val0.6406262386048285

**TextVQA**

TextVQA is a dataset designed to evaluate visual question answering models by requiring them to read and reason
about text present within images, containing 45,336 questions over 28,408 images from the OpenImages dataset.

The TextVQA dataset provides ocial OCR results, specically Rosetta OCR tokens. During testing with InstructBLIP
and LLaVA 1.5, the OCR results are input to the LLM as a prompt. If you want to input Rosetta OCR tokens, use the
following command:

**418 Chapter 1. Documentation**


GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 vqa-textvqa-val-ocr

The expected test results are:

textvqa_val_ocr 0.686240000000003

If you do not want to input Rosetta OCR tokens, use this command:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 vqa-textvqa-val

The expected test results are:

textvqa_val0.6420000000000028

**VizWiz**

The VizWiz VQA dataset is a visual question answering dataset created to help answer visual questions posed by blind
individuals. It contains over 31,000 visual questions, where users took a picture using a mobile phone and recorded a
spoken question about it. Each question comes with 10 crowdsourced answers. This dataset addresses tasks such as
predicting the answer to a visual question and determining whether a visual question can be answered.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 vqa-vizwiz-val

The expected test results are:

vizwiz_val 0.5899899435054417

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 vqa-vizwiz-test

For the test set, submit the results to theevaluation server.

The expected test results are:

57.3

**ChartQA**

The ChartQA dataset is a comprehensive benchmark for question answering about charts that involves both visual and
logical reasoning. It includes a mix of 9.6K human-written questions and 23.1K machine-generated questions derived
from chart summaries. This dataset is designed to evaluate models that can understand and analyze charts by answering
complex questions that often require multiple logical and arithmetic operations, as well as referencing visual features
of the charts.

The ChartQA dataset includes two test sets: chartqa_test_human and chartqa_test_augmented. Thenal score
for model evaluation is calculated as the average of the scores on these two test sets:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 vqa-chartqa-test

The expected test results are:

**1.39. Evaluation of InternVL-Chat-V1-1 419**


['chartqa_test_human', {'relaxed_accuracy':0.4034}]
['chartqa_test_augmented', {'relaxed_accuracy':0.795}]
average score= (40.34+ 79.5)/ 2 = 59.9

**DocVQA**

The DocVQA dataset consists of 50,000 questions on 12,000+ document images. It is designed for visual question
answering tasks where questions are answered using text within the document images. The dataset includes OCR
transcriptions and ground truth answers, supporting evaluation of models that interpret and extract information from
documents.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 vqa-docvqa-val

The expected test results are:

Overall ANLS:0.476

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 vqa-docvqa-test

For the test set, submit the results to theevaluation server.

The expected test results are:

Overall ANLS:0.481

### AI2D

The AI2D dataset contains over 5,000 grade school science diagrams with extensive annotations and 15,000 multiple-
choice questions for research on diagram understanding and question answering.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 vqa-ai2d-test

The expected test results are:

ai2diagram_test {'accuracy': 0.7240140932642487}

**InfographicVQA**

The InfographicVQA dataset is a collection of infographics accompanied by natural language questions and answers.
This dataset includes a diverse range of infographics sourced from thousands of dierent websites, ensuring a variety
of layouts and designs. It comprises 30,035 questions across 5,485 images, split into training, validation, and test sets.

For the validation set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 vqa-infovqa-val

The expected test results are:

Overall ANLS:0.3334

For the test set, run:

**420 Chapter 1. Documentation**


GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 vqa-infovqa-test

For the test set, submit the results to theevaluation server.

The expected test results are:

Overall ANLS:0.320

### GQA

The GQA dataset is a large-scale visual question answering dataset designed for real-world visual reasoning and com-
positional question answering. It contains over 22 million questions grounded in real images, each accompanied by
detailed scene graphs that describe objects, their attributes, and relationships within the scene. The dataset includes im-
ages from the Visual Genome dataset, with questions that require various reasoning skills such as spatial understanding
and multi-step inference.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 vqa-gqa-testdev

The expected test results are:

Accuracy:62.46%

**ScienceQA**

The ScienceQA dataset is a large-scale benchmark for multimodal science question answering, consisting of 21,208
multiple-choice questions derived from elementary and high school science curricula. This dataset features a diverse
range of topics across natural science, social science, and language science. It includes questions with image context
(48.7%), text context (48.2%), and both (30.8%).

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 scienceqa

The expected test results are:

Acc@ 1 : 0.90133019335647

### POPE

The POPE (Polling-based Object Probing Evaluation) dataset is designed to evaluate object hallucination in MLLMs.
The dataset consists of 3,000 questions related to the captions of 500 images. By treating the MLLMs’ answers to these
questions as a binary classication task, the dataset allows researchers to measure accuracy, precision, recall, and F1
scores to determine the extent of hallucination in the models.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 pope

The expected test results are:

Category: random,# samples: 2910
TP FP TN FN
1196 14 1396 304
Accuracy:0.8907216494845361
Precision: 0.9884297520661157
Recall:0.7973333333333333
F1 score:0.8826568265682657
Yes ratio: 0.41580756013745707
(continues on next page)

**1.39. Evaluation of InternVL-Chat-V1-1 421**


```
(continued from previous page)
```
0.883, 0.891,0.988, 0.797,0.416
====================================
Category: popular,# samples: 3000
TP FP TN FN
1196 47 1453 304
Accuracy:0.883
Precision: 0.9621882542236525
Recall:0.7973333333333333
F1 score:0.8720379146919431
Yes ratio: 0.41433333333333333
0.872, 0.883,0.962, 0.797,0.414
====================================
Category: adversarial,# samples: 3000
TP FP TN FN
1196 89 1411 304
Accuracy:0.869
Precision: 0.930739299610895
Recall:0.7973333333333333
F1 score:0.858886894075404
Yes ratio: 0.42833333333333334
0.859, 0.869,0.931, 0.797,0.428
====================================

(88.3+ 87.2+85.9) / 3 =87.1

**Tiny LVLM**

The Tiny LVLM-eHub is a streamlined evaluation benchmark designed to assess the multimodal capabilities of
MLLMs, including models like Bard. It focuses on six categories of multimodal abilities: visual perception, visual
knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 tiny_lvlm

The expected test results are:

Visual_Knowledge_Acquisition:0.74
Object_Hallucination:0.8966666666666666
Visual_Commonsense: 0.6
Visual_Perception:0.574
Visual_Reasoning:0.6218181818181818
Overall:3.4324848484848486

### MMMU

The MMMU dataset is a comprehensive benchmark designed to evaluate multimodal models on college-level tasks that
require domain-specic knowledge and reasoning. It includes 11,500 questions sourced from college exams, quizzes,
and textbooks, spanning six disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social
Science, and Tech & Engineering. These questions cover 30 subjects and feature 30 types of images, such as charts,
diagrams, maps, tables, and more.

For the validation set, run:

**422 Chapter 1. Documentation**


GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 mmmu-val

The expected test results are:

{'Overall-Art and Design': {'num': 120 ,'acc': 0.558}, 'Art': {'num': 30 ,'acc': 0.633},
˓→'Art_Theory': {'num': 30 ,'acc':0.7},'Design': {'num': 30 , 'acc':0.633}, 'Music': {
˓→'num': 30 ,'acc': 0.267},'Overall-Business': {'num': 150 ,'acc': 0.313}, 'Accounting
˓→': {'num': 30 , 'acc': 0.333},'Economics': {'num': 30 ,'acc': 0.4}, 'Finance': {'num':␣
˓→ 30 ,'acc':0.133}, 'Manage': {'num': 30 , 'acc': 0.433},'Marketing': {'num': 30 ,'acc
˓→':0.267}, 'Overall-Science': {'num': 150 ,'acc':0.333}, 'Biology': {'num': 30 ,'acc
˓→':0.367}, 'Chemistry': {'num': 30 , 'acc':0.3}, 'Geography': {'num': 30 , 'acc':0.267}
˓→, 'Math': {'num': 30 , 'acc':0.4}, 'Physics': {'num': 30 , 'acc':0.333},'Overall-
˓→Health and Medicine': {'num': 150 , 'acc':0.393}, 'Basic_Medical_Science': {'num': 30 ,
˓→'acc':0.367}, 'Clinical_Medicine': {'num': 30 , 'acc':0.433}, 'Diagnostics_and_
˓→Laboratory_Medicine': {'num': 30 ,'acc': 0.4},'Pharmacy': {'num': 30 , 'acc':0.367},
˓→'Public_Health': {'num': 30 ,'acc': 0.4}, 'Overall-Humanities and Social Science': {
˓→'num': 120 ,'acc': 0.542}, 'History': {'num': 30 ,'acc': 0.567},'Literature': {'num':␣
˓→ 30 ,'acc':0.767}, 'Sociology': {'num': 30 , 'acc':0.4}, 'Psychology': {'num': 30 , 'acc
˓→':0.433}, 'Overall-Tech and Engineering': {'num': 210 ,'acc':0.29}, 'Agriculture': {
˓→'num': 30 ,'acc': 0.433},'Architecture_and_Engineering': {'num': 30 , 'acc':0.267},
˓→'Computer_Science': {'num': 30 , 'acc':0.233},'Electronics': {'num': 30 , 'acc':0.333}
˓→, 'Energy_and_Power': {'num': 30 ,'acc': 0.2},'Materials': {'num': 30 ,'acc': 0.233},
˓→'Mechanical_Engineering': {'num': 30 ,'acc': 0.333}, 'Overall': {'num': 900 ,'acc': 0.
˓→ 388 }}

For the test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 mmmu-test

Then submit the results to theevaluation server. The expected test results are:

All subject resultes
{'Overall-Art & Design': {'num': 1163 ,'acc': 0.537},'Art': {'num': 231 ,'acc': 0.606},
˓→'Art_Theory': {'num': 429 , 'acc':0.59}, 'Design': {'num': 169 ,'acc': 0.746}, 'Music
˓→': {'num': 334 ,'acc': 0.314}, 'Overall-Business': {'num': 1428 ,'acc':0.317},
˓→'Accounting': {'num': 380 , 'acc':0.345}, 'Economics': {'num': 267 ,'acc': 0.345},
˓→'Finance': {'num': 355 ,'acc': 0.245},'Manage': {'num': 245 , 'acc':0.29}, 'Marketing
˓→': {'num': 181 ,'acc': 0.392}, 'Overall-Science': {'num': 2426 ,'acc': 0.282}, 'Biology
˓→': {'num': 345 ,'acc': 0.357}, 'Chemistry': {'num': 603 , 'acc':0.244}, 'Geography': {
˓→'num': 565 ,'acc': 0.312}, 'Math': {'num': 505 , 'acc':0.275}, 'Physics': {'num': 408 ,
˓→'acc':0.24}, 'Overall-Health & Medicine': {'num': 1752 , 'acc':0.365}, 'Basic_Medical_
˓→Science': {'num': 326 , 'acc':0.436},'Clinical_Medicine': {'num': 325 ,'acc': 0.397},
˓→'Diagnostics_and_Laboratory_Medicine': {'num': 162 ,'acc':0.364}, 'Pharmacy': {'num':␣
˓→ 430 , 'acc':0.309}, 'Public_Health': {'num': 509 ,'acc': 0.346},'Overall-Humanities &␣
˓→Social Science': {'num': 947 ,'acc':0.564}, 'History': {'num': 278 ,'acc':0.579},
˓→'Literature': {'num': 112 , 'acc':0.804}, 'Sociology': {'num': 252 ,'acc': 0.556},
˓→'Psychology': {'num': 305 , 'acc':0.469}, 'Overall-Tech & Engineering': {'num': 2784 ,
˓→'acc':0.28}, 'Agriculture': {'num': 287 ,'acc': 0.369}, 'Architecture_and_Engineering
˓→': {'num': 551 ,'acc': 0.272}, 'Computer_Science': {'num': 371 ,'acc': 0.315},
˓→'Electronics': {'num': 256 , 'acc': 0.152},'Energy_and_Power': {'num': 432 ,'acc': 0.
˓→ 306 }, 'Materials': {'num': 458 , 'acc':0.26}, 'Mechanical_Engineering': {'num': 429 ,
˓→'acc':0.27}, 'Overall': {'num': 10500 , 'acc':0.353}}

```
(continues on next page)
```
**1.39. Evaluation of InternVL-Chat-V1-1 423**


```
(continued from previous page)
```
Leaderboard
[{'test_split': {'Art & Design': 0.537,'Business':0.317, 'Science':0.282, 'Health &␣
˓→Medicine':0.365, 'Humanities & Social Science': 0.564,'Tech & Engineering':0.28,
˓→'Overall':0.353}}]

**MMVet (GPT-4-0613)**

```
Warning: Here, we use GPT-4-0613 as the judge model, while in VLMEvalKit, GPT-4-Turbo is used
as the judge model. Using dierent versions of GPT-4 can result in signicant score variations. Therefore,
testing the same model with the two codebases can lead to notable score dierences.
```
The MM-Vet dataset is a comprehensive benchmark designed to evaluate the integrated capabilities of MLLMs. It
encompasses six core vision-language (VL) capabilities: recognition, knowledge, optical character recognition (OCR),
spatial awareness, language generation, and math. The dataset includes 200 images and 218 questions, each requir-
ing one or more of these capabilities to answer. The evaluation uses an open-ended LLM-based approach, allowing
assessment across various answer styles and question types.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 mmvet

Then, submit the results to theevaluation server. The expected test results are:

runs: [46.7]

**MMBench**

The MMBench dataset is a comprehensive multi-modality benchmark designed to evaluate thene-grained abilities of
vision-language models. It contains around 3,000 multiple-choice questions covering 20 ability dimensions, structured
into a hierarchical taxonomy. These dimensions include perception and reasoning abilities, further broken down into
specic skills like coarse andne-grained perception, attribute reasoning, and logic reasoning.

For the English dev / test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 mmbench-dev-en
GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 mmbench-test-en

Then, submit the results to theevaluation server. The expected test results are:

mmbench-dev-en:76.7
mmbench-test-en:75.4

For the Chinese dev / test set, run:

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 mmbench-dev-cn
GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 mmbench-test-cn

Then, submit the results to theevaluation server. The expected test results are:

mmbench-dev-cn:71.9
mmbench-test-cn:70.3

**424 Chapter 1. Documentation**


**CCBench**

CCBench, a multi-modal benchmark in the domain of Chinese Culture, is designed to evaluate the performance of
MLLMs on tasks specically related to Chinese cultural content.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 ccbench-dev

Then, submit the results to theevaluation server. The expected test results are:

ccbench-dev:43.3

### SEED

CCBench is a multimodal benchmark specically designed to evaluate models on tasks related to Chinese culture. It
is part of the larger MMBench suite of benchmarks, developed by the OpenCompass Community, and aims to provide
ne-grained evaluations across various capabilities of vision-language models. CCBench includes 510 questions in a
multiple-choice format, focusing on cultural knowledge and understanding.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 seed

The expected test results are:

DatatypeScene Understanding:78.63%
DatatypeInstance Identity: 77.44%
DatatypeInstance Location: 74.66%
DatatypeInstance Attributes:70.04%
DatatypeInstances Counting: 65.79%
DatatypeSpatial Relation:58.90%
DatatypeInstance Interaction:77.32%
DatatypeVisual Reasoning:79.15%
DatatypeText Understanding: 39.53%
DatatypeAction Recognition: 54.34%
DatatypeAction Prediction: 40.82%
DatatypeProcedure Understanding: 37.24%
Total accuracy: 67.40%
+Image accuracy: 73.24%
Video accuracy: 45.28%

### MMVP

The MMVP dataset is designed to benchmark the performance of multimodal large language models (MLLMs) in
visual question answering tasks. This dataset focuses on identifying “CLIP-blind pairs,” which are images that appear
similar to the CLIP model despite having clear visual dierences. The MMVP dataset includes 300 images derived
from ImageNet-1k and LAION-Aesthetics, each paired with straightforward questions to evaluate the models’ visual
capabilities. It highlights the challenges these systems face, often leading to incorrect responses and hallucinated
explanations.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 mmvp

The expected test results are:

Evaluating MMVP ...
Results saved to results/MMVP_240725163208.jsonl
The accuracyis 0.4466666666666667

**1.39. Evaluation of InternVL-Chat-V1-1 425**


**LLaVA-Bench (GPT-4-0613)**

```
Warning: Here, we use GPT-4-0613 as the judge model, while in VLMEvalKit, GPT-4-Turbo is used
as the judge model. Using dierent versions of GPT-4 can result in signicant score variations. Therefore,
testing the same model with the two codebases can lead to notable score dierences.
```
The LLaVA-Bench-in-the-Wild dataset is designed to evaluate the capabilities of MLLMs in handling more complex
and diverse visual tasks. It includes a set of 24 images with 60 associated questions, covering a range of indoor and
outdoor scenes, memes, paintings, and sketches. Each image is paired with detailed, manually curated descriptions and
questions that test the model’s generalizability to novel domains.

export OPENAI_API_KEY='your openai key'
GPUS= 1 sh evaluate.sh pretrained/InternVL-Chat-V1-1 llava-bench

The expected test results are:

all*72.8* 87.7 63.8
llava_bench_complex [8.75,6.643] 75.9
llava_bench_complex 75.9 87.5 66.4
llava_bench_conv [9.0,6.118] 68.0
llava_bench_conv68.0 90.0 61.2
llava_bench_detail [8.533,6.2]72.7
llava_bench_detail72.7 85.3 62.0

**RefCOCO Series**

RefCOCO, RefCOCO+, and RefCOCOg are datasets used for tasks involving referring expression comprehension,
segmentation, and generation. These datasets are built upon the MSCOCO dataset, and they are essential for evaluating
models in natural language processing and computer vision.

GPUS= 8 sh evalulate.sh pretrained/InternVL-Chat-V1-1 refcoco

The expected test results are:

RefCOCO val, 84.7
RefCOCO testA, 89.9
RefCOCO testB, 78.6
RefCOCO+ val, 78.5
RefCOCO+ testA, 85.6
RefCOCO+ testB, 70.1
RefCOCO-g val, 81.0
RefCOCO-g test, 81.4

**MVBench**

MVBench is a comprehensive multimodal video understanding benchmark developed to evaluate the temporal com-
prehension capabilities of MLLMs. It includes 20 challenging video tasks that require temporal understanding and
cannot be eectively solved using a single frame. The benchmark uses a novel static-to-dynamic method, transforming
static tasks into dynamic ones to systematically generate video tasks that demand a wide range of temporal skills, from
perception to cognition.

We evaluate our models on MVBench by extracting 16 frames from each video, and each frame was resized to a
448x448 image.

GPUS= 8 sh evaluate.sh pretrained/InternVL-Chat-V1-1 mvbench --load-in-8bit

**426 Chapter 1. Documentation**


The expected test results are:

{"Action Sequence": 62.5,"Action Prediction": 56.49999999999999,"Action Antonym":49.0,
"Fine-grained Action":41.0, "Unexpected Action": 64.5, "Object Existence": 49.0,
"Object Interaction":64.0,"Object Shuffle": 32.5,"Moving Direction": 39.5,
"Action Localization":27.500000000000004, "Scene Transition": 88.5, "Action Count": 42.
˓→ 0 ,
"Moving Count": 33.0,"Moving Attribute": 57.99999999999999,"State Change": 46.5,
"Fine-grained Pose": 44.0,"Character Order": 59.0,"Egocentric Navigation": 38.5,
"Episodic Reasoning":44.5,"Counterfactual Inference": 39.0,"Avg": 48.949999999999996}

**1.39.3 Evaluation using VLMEvalKit Codebase**

**Data Preparation**

VLMEvalKit will automatically download the data for evaluation, so you do not need to prepare it manually.

**MathVista**

The MathVista dataset is a comprehensive benchmark for evaluating mathematical reasoning within visual contexts. It
consists of three newly created datasets—IQTest, FunctionQA, and PaperQA—designed to address logical reasoning
on puzzle testgures, algebraic reasoning over functional plots, and scientic reasoning with academic papergures,
respectively.

torchrun --nproc-per-node= 8 run.py --data MathVista_MINI --model InternVL-Chat-V1-1 --
˓→verbose

The expected test results are:

-- --------------------------- ---- --- --- ------- -------
0 Overall 1000 594 363 59.4 36.3
1 scientific reasoning 122 95 58 77.8689 47.541
2 textbook question answering 158 107 74 67.7215 46.8354
3 numeric commonsense 144 67 56 46.5278 38.8889
4 arithmetic reasoning 353 143 122 40.5099 34.5609
5 visual question answering 179 102 80 56.9832 44.6927
6 geometry reasoning 239 198 63 82.8452 26.3598
7 algebraic reasoning 281 217 74 77.2242 26.3345
8 geometry problem solving 208 181 50 87.0192 24.0385
9 math word problem 186 74 66 39.7849 35.4839
10 logical reasoning 37 25 5 67.5676 13.5135
11 figure question answering 269 130 93 48.3271 34.5725
12 statistical reasoning 301 126 109 41.8605 36.2126
-- --------------------------- ---- --- --- ------- -------

**HallusionBench**

HallusionBench is a comprehensive benchmark designed to evaluate image-context reasoning in MLLMs, focusing on
identifying issues related to language hallucination and visual illusion. The dataset consists of 346 images paired with
1,129 questions crafted by human experts. These questions are divided into two categories: Visual Dependent (VD)
and Visual Supplement (VS), allowing the benchmark to assess the nuanced understanding and interpretation of visual
data by MLLMs.

torchrun --nproc-per-node= 8 run.py --data HallusionBench --model InternVL-Chat-V1-1 --
˓→verbose

**1.39. Evaluation of InternVL-Chat-V1-1 427**


The expected test results are:

"split","aAcc","fAcc","qAcc"
"Overall","56.256572029442694","26.011560693641616","26.153846153846157"
"VD","54.483925549915405","29.565217391304348","23.465703971119133"
"VS","59.166666666666664","18.96551724137931","30.337078651685395"
"VD_figure","70.0","51.21951219512195","41.02564102564102"
"VD_ocr","75.28089887640449","53.48837209302325","51.162790697674424"
"VD_math","47.22222222222222","5.555555555555555","18.51851851851852"
"VS_ocr","53.70370370370371","23.076923076923077","11.11111111111111"
"VS_map","54.6875","13.636363636363635","12.5"
"VS_chart","62.30769230769231","17.5","44.73684210526316"
"VD_video","42.35294117647059","10.416666666666668","5.797101449275362"
"VD_illusion","52.77777777777778","27.419354838709676","18.055555555555554"
"VS_table","60.71428571428571","21.428571428571427","30.23255813953488"

result =(56.256572029442694 +26.011560693641616 +26.153846153846157) / 3 =36.1

**MMStar**

The MMStar dataset is an advanced multimodal benchmark designed to evaluate the capabilities of MLLMs. It com-
prises 1,500 carefully selected samples that are balanced and puried to ensure they exhibit visual dependency and
minimal data leakage. The dataset evaluates models across six core capabilities and 18 detailed axes, focusing on
complex multimodal tasks that require advanced reasoning and understanding of visual content.

torchrun --nproc-per-node= 8 run.py --data MMStar --model InternVL-Chat-V1-1 --verbose

The expected test results are:

"split","Overall","coarse perception","fine-grained perception","instance reasoning",
˓→"logical reasoning","math","science & technology"
"none","0.452","0.652","0.384","0.612","0.404","0.292","0.368"

**OCRBench**

OCRBench is a comprehensive evaluation benchmark designed to assess the OCR capabilities of MLLMs. It includes
ve components: Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA,
Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). The benchmark
encompasses data from 29 datasets, making it one of the most thorough OCR evaluation tools available. OCRBench
aims to reveal both the strengths and weaknesses of MLLMs, particularly in handling multilingual text, handwritten text,
non-semantic text, and mathematical expressions. The benchmark includes 1,000 question-answer pairs, all manually
veried for precision.

torchrun --nproc-per-node= 8 run.py --data OCRBench --model InternVL-Chat-V1-1 --verbose

The expected test results are:

{
"Text Recognition": 230 ,
"Scene Text-centric VQA": 157 ,
"Doc-oriented VQA": 72 ,
"Key Information Extraction": 71 ,
"Handwritten Mathematical Expression Recognition": 0 ,
"Final Score": 530 ,
(continues on next page)

**428 Chapter 1. Documentation**


(continued from previous page)
"Final Score Norm": 53.0
}

### MMMU

The MMMU dataset is a comprehensive benchmark designed to evaluate multimodal models on college-level tasks that
require domain-specic knowledge and reasoning. It includes 11,500 questions sourced from college exams, quizzes,
and textbooks, spanning six disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social
Science, and Tech & Engineering. These questions cover 30 subjects and feature 30 types of images, such as charts,
diagrams, maps, tables, and more.

torchrun --nproc-per-node= 8 run.py --data MMMU_DEV_VAL --model InternVL-Chat-V1-1 --
˓→verbose

The expected test results are:

----------------------------------- ------------------- -------------------
split validation dev
Overall 0.4022222222222222 0.47333333333333333
Accounting 0.36666666666666664 0.2
Agriculture 0.4666666666666667 0.2
Architecture_and_Engineering 0.23333333333333334 0.4
Art 0.6333333333333333 0.8
Art_Theory 0.7333333333333333 0.6
Basic_Medical_Science 0.4 0.6
Biology 0.36666666666666664 0.4
Chemistry 0.4666666666666667 0.8
Clinical_Medicine 0.4 0.4
Computer_Science 0.26666666666666666 0.6
Design 0.6333333333333333 0.6
Diagnostics_and_Laboratory_Medicine 0.4 0.8
Economics 0.43333333333333335 0.8
Electronics 0.4 0.4
Energy_and_Power 0.2 0.2
Finance 0.3333333333333333 0.8
Geography 0.26666666666666666 0.0
History 0.5333333333333333 0.6
Literature 0.7666666666666667 0.4
Manage 0.4666666666666667 0.2
Marketing 0.26666666666666666 0.8
Materials 0.16666666666666666 0.2
Math 0.43333333333333335 0.4
Mechanical_Engineering 0.26666666666666666 0.2
Music 0.26666666666666666 0.2
Pharmacy 0.4 0.6
Physics 0.3333333333333333 0.2
Psychology 0.43333333333333335 0.6
Public_Health 0.36666666666666664 0.4
Sociology 0.36666666666666664 0.8
Art&Design 0.5666666666666667 0.55
Business 0.37333333333333335 0.56
Health &Medicine 0.3933333333333333 0.56
(continues on next page)

**1.39. Evaluation of InternVL-Chat-V1-1 429**


```
(continued from previous page)
```
Humanities &Social Science 0.525 0.6
Science 0.37333333333333335 0.36
Tech& Engineering 0.2857142857142857 0.3142857142857143
----------------------------------- ------------------- -------------------

**RealWorldQA**

The RealWorldQA dataset is a benchmark designed to evaluate the real-world spatial understanding capabilities of
multimodal AI models. It consists of over 700 images, each accompanied by a question and a veriable answer, focusing
on various real-world scenarios, including those captured from vehicles. This dataset aims to test how well AI models
comprehend physical environments and spatial relations, enhancing their ability to interpret and analyze real-world
scenes.

torchrun --nproc-per-node= 8 run.py --data RealWorldQA --model InternVL-Chat-V1-1 --
˓→verbose

The expected test results are:

"split","Overall"
"none","0.5803921568627451"

**MMVet (GPT-4-Turbo)**

The MM-Vet dataset is a comprehensive benchmark designed to evaluate the integrated capabilities of MLLMs. It
encompasses six core vision-language (VL) capabilities: recognition, knowledge, optical character recognition (OCR),
spatial awareness, language generation, and math. The dataset includes 200 images and 218 questions, each requir-
ing one or more of these capabilities to answer. The evaluation uses an open-ended LLM-based approach, allowing
assessment across various answer styles and question types.

torchrun --nproc-per-node= 8 run.py --data MMVet --model InternVL-Chat-V1-1 --verbose

The expected test results are:

- ------- --- -------
0 rec 187 49.9465
1 ocr 108 42.5
2 know 84 36.4286
3 gen 80 36.25
4 spat 75 42.9333
5 math 26 22.3077
6 Overall 218 44.7706
- ------- --- -------

Note that because the version of GPT-4 used for scoring diers from the ocial server, the scores tested by VLMEvalKit
will be slightly dierent.

**LLaVA-Bench (GPT-4-Turbo)**

The LLaVA-Bench-in-the-Wild dataset is designed to evaluate the capabilities of MLLMs in handling more complex
and diverse visual tasks. It includes a set of 24 images with 60 associated questions, covering a range of indoor and
outdoor scenes, memes, paintings, and sketches. Each image is paired with detailed, manually curated descriptions and
questions that test the model’s generalizability to novel domains.

**430 Chapter 1. Documentation**


torchrun --nproc-per-node= 8 run.py --data LLaVABench --model InternVL-Chat-V1-1 --verbose

The expected test results are:

- ------- ---- ---- ----
0 overall*64.8*46.7 72
1 complex 64.4 47.1 73.2
2 conv 67.1 57.6 85.9
3 detail 61.7 33.3 54
- ------- ---- ---- ----

**1.39.4 Citation**

If yound this project useful in your research, please consider citing:

@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
˓→Recognition},
pages={24185--24198},
year={2024}
}

## 1.40 InternViT-6B for Image Classication

This folder contains the implementation of the InternViT-6B for image classication, which corresponds to Section
4.2.1 of ourInternVL 1.0 paper. The codebase for this part is derived fromInternImage, with some code references to
EVAandDINOv2. Thanks for their great work.

In this part, we validate the visual perception capabilities of InternViT-6B, the most core component of InternVL 1.0.
We evaluate the quality of visual representation produced by InternViT-6B using the ImageNet-1K dataset. Following
common practices, we adopt the linear probing evaluation, i.e. training a linear classier while keeping the backbone
frozen. In addition to the ImageNet-1K validation set, we also report performance metrics on several ImageNet variants,
to benchmark the domain generalization capability.

InternViT-6B follows the structure of vanilla ViT, and its hyperparameters are listed in the table below.

**1.40.1 Data Preparation**

```
Please prepare the dataset according to your needs.
```
- ImageNet-1K: We use the standard ImageNet dataset, you can download it fromhttp://image-net.org/.
- ImageNet-A: Download it fromhttps://people.eecs.berkeley.edu/~hendrycks/imagenet-a.tar.
- ImageNet-R: Download it fromhttps://people.eecs.berkeley.edu/~hendrycks/imagenet-r.tar.
- ImageNetV2: Download it from https://imagenetv2public.s3-us-west-2.amazonaws.com/
    imagenetv2-matched-frequency.tar.gz.
- ImageNet-Sketch: Download it using gdown.

**1.40. InternViT-6B for Image Classication 431**


```
# GDown is needed to download the dataset.
# Please install it via`pip install gdown`
gdown --id 1Mj0i5HBthqH1p_yeXzsg22gZduvgoNeA
```
First, please prepare the ImageNet-1K, ImageNet-A, ImageNet-R, ImageNetV2, and ImageNet-Sketch datasets
following the directory structure outlined below.

$ tree data
data
imagenet-1k
train
n01498041
...
val
ILSVRC2012_val_00000001.JPEG
...
imagenet-a
n01498041
...
imagenet-r
n01443537
...
imagenet-sketch
n01440764
...
imagenetv2
ImageNetV2-matched-frequency

Then, unzip the train.txt.zip and val.txt.zip in meta_data/.

cd meta_data/
unzip train.txt.zip
unzip val.txt.zip

**1.40.2 Model Preparation**

```
model name type param download size
intern_vit_6b_224px.pth pytorch 6B HF link 12 GB
intern_vit_6b_224px_head.pth pytorch 13M HF link 25.7 MB
```
Download the above model weights and place them in the pretrained/ folder.

cd pretrained
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/intern_vit_6b_224px.pth
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/intern_vit_6b_224px_head.pth

The directory structure is:

pretrained
intern_vit_6b_224px_head.pth
intern_vit_6b_224px.pth

**432 Chapter 1. Documentation**


**1.40.3 Linear Probing on ImageNet-1K**

```
Warning : Please install apex before training (see installation guide for details).
```
To train a linear classier for InternViT-6B on ImageNet with 8 GPUs, run:

python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --cfg␣
˓→configs/intern_vit_6b_1k_224.yaml
# or manage jobs with slurm
GPUS= 8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224.yaml --
˓→launcher slurm

Note, it is normal for the following information to appear during training and it can be safely ignored:

```
_IncompatibleKeys(missing_keys=[], unexpected_keys=[‘clip_projector.norm1_q.weight’,
‘clip_projector.norm1_q.bias’, ‘clip_projector.norm1_k.weight’, ‘clip_projector.norm1_k.bias’,
‘clip_projector.norm1_v.weight’, ‘clip_projector.norm1_v.bias’, ‘clip_projector.cross_attn.q_bias’,
‘clip_projector.cross_attn.k_bias’, ‘clip_projector.cross_attn.v_bias’, ‘clip_projector.cross_attn.q.weight’,
‘clip_projector.cross_attn.k.weight’, ‘clip_projector.cross_attn.v.weight’,
‘clip_projector.cross_attn.proj.weight’, ‘clip_projector.cross_attn.proj.bias’])
```
**1.40.4 Evaluation**

```
Warning : Please install apex before evaluation (see installation guide for details).
```
```
model name IN-1K IN-ReaL IN-V2 IN-A IN-R IN-Sketch download
intern_vit_6b_1k_224.yaml 88.2 90.4 79.9 77.5 89.8 69.1 ckpt| log
```
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval␣
˓→\
--cfg configs/intern_vit_6b_1k_224.yaml --resume pretrained/intern_vit_6b_224px_head.
˓→pth
# or manage jobs with slurm
GPUS= 8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224.yaml --eval\
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm

Expected results:

* Acc@1 88.230Acc@5 98.474
Accuracy of the network on the 50000 test images:88.2%

**Note: ImageNet-ReaL now only supports single-GPU testing.**

python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --eval␣
˓→\
--cfg configs/intern_vit_6b_1k_224_test_imagenet_real.yaml --resume pretrained/
˓→intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS= 1 GPUS_PER_NODE= 1 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_
˓→224_test_imagenet_real.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm

Expected results:

**1.40. InternViT-6B for Image Classication 433**


*ReaL Acc@1 90.437Acc@5 98.567loss0.605
ReaL Accuracy of the network on the 50000 test images:90.4%

python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval␣
˓→\
--cfg configs/intern_vit_6b_1k_224_test_imagenetv2.yaml --resume pretrained/intern_
˓→vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS= 8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_
˓→imagenetv2.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm

Expected results:

* Acc@1 79.940Acc@5 95.340
Accuracy of the network on the 10000 test images:79.9%

python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval␣
˓→\
--cfg configs/intern_vit_6b_1k_224_test_imagenet_a.yaml --resume pretrained/intern_
˓→vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS= 8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_
˓→imagenet_a.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm

Expected results:

* Acc@1 77.479Acc@5 92.737
Accuracy of the network on the 7500 test images:77.5%

python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval␣
˓→\
--cfg configs/intern_vit_6b_1k_224_test_imagenet_r.yaml --resume pretrained/intern_
˓→vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS= 8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_
˓→imagenet_r.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm

Expected results:

* Acc@1 89.777Acc@5 97.023
Accuracy of the network on the 30000 test images:89.8%

python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval␣
˓→\
--cfg configs/intern_vit_6b_1k_224_test_imagenet_sketch.yaml --resume pretrained/
˓→intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS= 8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_
˓→imagenet_sketch.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm

**434 Chapter 1. Documentation**


Expected results:

* Acc@1 69.117Acc@5 88.341
Accuracy of the network on the 50889 test images:69.1%

**1.40.5 Citation**

If yound this project useful in your research, please consider citing:

@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
˓→Recognition},
pages={24185--24198},
year={2024}
}

## 1.41 InternVL for Zero-Shot Image Classication & Image-Text Re-

## trieval

This folder contains the implementation of InternVL 1.0 for zero-shot image classication and zero-shot image-text
retrieval, which corresponds to Section 4.3 of ourInternVL 1.0 paper. We mainly useCLIP Benchmarkto evaluate the
performance of InternVL. Thanks for this great work.

**1.41.1 Installation**

First, follow the _installation guide_ to perform some basic installations.

In addition, using this codebase requires executing the following steps:

- Install other requirements:

```
pip install -r requirements.txt
```
- Install clip_benchmark using development mode:

```
python setup.py develop
# You can also add the current directory to PYTHONPATH instead.
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
```
**1.41.2 Data Preparation**

This codebase will automatically download the required dataset. If the dataset fails to download automatically, please
refer to thiscodefor manual downloading.

**1.41. InternVL for Zero-Shot Image Classication & Image-Text Retrieval 435**


**1.41.3 Model Preparation**

```
model name type param download size
internvl_c_13b_224px.pth pytorch 13B HF link 25.4 GB
InternVL-14B-224px huggingface 14B HF link 27.7 GB
```
Download the above model weights and place them in the pretrained/ folder.

You can download either the PyTorch version or the Hugging Face version based on your needs.

cd pretrained/
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/internvl_c_13b_224px.pth
# pip install -U huggingface_hub
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL-14B-224px --local-dir InternVL-14B-224px

The directory structure is:

pretrained
internvl_c_13b_224px.pth
InternVL-14B-224px

**1.41.4 Evaluation: Zero-Shot Image Classication**

**ImageNet variants and ObjectNet**

```
model name IN-1K IN-A IN-R IN-V2 IN-Sketch ObjectNet average
InternVL-C 83.2 83.8 95.5 77.3 73.9 80.6 0.8 82.4
```
CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en" \
--task "zeroshot_classification"--dataset"imagenet1k"--dataset_root ./data/
˓→imagenet-1k/\
--model internvl_c_classification --pretrained ./pretrained/internvl_c_13b_224px.pth␣
˓→--output result.json

Expected results:

{"dataset": "imagenet1k","model":"internvl_c_classification", "pretrained":"./
˓→pretrained/internvl_c_13b_224px.pth","task":"zeroshot_classification",
"metrics": {"acc1": 0.83178, "acc5": 0.97322, "mean_per_class_recall":0.83204},
˓→"language":"en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en" \
--task "zeroshot_classification"--dataset"imagenet-a"--dataset_root ./data/
˓→imagenet-a/\
--model internvl_c_classification --pretrained ./pretrained/internvl_c_13b_224px.pth␣
˓→--output result.json

Expected results:

**436 Chapter 1. Documentation**


{"dataset": "imagenet-a","model":"internvl_c_classification", "pretrained":"./
˓→pretrained/internvl_c_13b_224px.pth","task":"zeroshot_classification",
"metrics": {"acc1": 0.8377333333333333,"acc5":0.9558666666666666, "mean_per_class_
˓→recall": 0.8183934468491632},"language":"en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en" \
--task "zeroshot_classification"--dataset"imagenet-r"--dataset_root ./data/
˓→imagenet-r/\
--model internvl_c_classification --pretrained ./pretrained/internvl_c_13b_224px.pth␣
˓→--output result.json

Expected results:

{"dataset": "imagenet-r","model":"internvl_c_classification", "pretrained":"./
˓→pretrained/internvl_c_13b_224px.pth","task":"zeroshot_classification",
"metrics": {"acc1": 0.9549666666666666,"acc5":0.9918333333333333, "mean_per_class_
˓→recall": 0.9460205918105684},"language":"en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en" \
--task "zeroshot_classification"--dataset"imagenetv2"--dataset_root ./data/
˓→imagenetv2/\
--model internvl_c_classification --pretrained ./pretrained/internvl_c_13b_224px.pth␣
˓→--output result.json

Expected results:

{"dataset": "imagenetv2","model":"internvl_c_classification", "pretrained":"./
˓→pretrained/internvl_c_13b_224px.pth","task":"zeroshot_classification",
"metrics": {"acc1": 0.7726,"acc5":0.9468,"mean_per_class_recall": 0.7738000000000001},
˓→"language":"en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en" \
--task "zeroshot_classification"--dataset"imagenet_sketch"--dataset_root ./data/
˓→imagenet-sketch/\
--model internvl_c_classification --pretrained ./pretrained/internvl_c_13b_224px.pth␣
˓→--output result.json

Expected results:

{"dataset": "imagenet_sketch","model":"internvl_c_classification", "pretrained":"./
˓→pretrained/internvl_c_13b_224px.pth","task":"zeroshot_classification",
"metrics": {"acc1": 0.7385879070133035,"acc5":0.9199827074613374, "mean_per_class_
˓→recall": 0.7386403921568627},"language":"en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en" \
--task "zeroshot_classification"--dataset"objectnet"--dataset_root ./data/
˓→objectnet-1.0/ \
--model internvl_c_classification --pretrained ./pretrained/internvl_c_13b_224px.pth␣
˓→--output result.json

**1.41. InternVL for Zero-Shot Image Classication & Image-Text Retrieval 437**


Expected results:

{"dataset": "objectnet", "model": "internvl_c_classification", "pretrained": "./
˓→pretrained/internvl_c_13b_224px.pth","task":"zeroshot_classification",
"metrics": {"acc1": 0.8059114891784215,"acc5":0.9387853989447615, "mean_per_class_
˓→recall": 0.797040815749882}, "language": "en"}

**Multilingual ImageNet-1K**

```
model name IN-1K (EN) IN-1K (ZH) IN-1K (JP) IN-1K (AR) IN-1K (IT) average
InternVL-C 83.2 64.5 61.5 44.9 65.7 64.0
```
CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "cn" \
--task "zeroshot_classification"--dataset"imagenet1k"--dataset_root ./data/
˓→imagenet-1k/\
--model internvl_c_classification --pretrained ./pretrained/internvl_c_13b_224px.pth␣
˓→--output result.json

Expected results:

{"dataset": "imagenet1k","model":"internvl_c_classification", "pretrained":"./
˓→pretrained/internvl_c_13b_224px.pth","task":"zeroshot_classification",
"metrics": {"acc1": 0.6446,"acc5":0.87842,"mean_per_class_recall":0.6444200000000001}
˓→, "language": "cn"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "jp" \
--task "zeroshot_classification"--dataset"imagenet1k"--dataset_root ./data/
˓→imagenet-1k/\
--model internvl_c_classification --pretrained ./pretrained/internvl_c_13b_224px.pth␣
˓→--output result.json

Expected results:

{"dataset": "imagenet1k","model":"internvl_c_classification", "pretrained":"./
˓→pretrained/internvl_c_13b_224px.pth","task":"zeroshot_classification",
"metrics": {"acc1": 0.61488, "acc5": 0.81146, "mean_per_class_recall":0.
˓→ 6140599999999999 }, "language": "jp"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "ar" \
--task "zeroshot_classification"--dataset"imagenet1k"--dataset_root ./data/
˓→imagenet-1k/\
--model internvl_c_classification --pretrained ./pretrained/internvl_c_13b_224px.pth␣
˓→--output result.json

Expected results:

{"dataset": "imagenet1k","model":"internvl_c_classification", "pretrained":"./
˓→pretrained/internvl_c_13b_224px.pth","task":"zeroshot_classification",
(continues on next page)

**438 Chapter 1. Documentation**


```
(continued from previous page)
```
"metrics": {"acc1": 0.4486,"acc5":0.66418,"mean_per_class_recall":0.44764},"language
˓→":"ar"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "it" \
--task "zeroshot_classification"--dataset"imagenet1k"--dataset_root ./data/
˓→imagenet-1k/\
--model internvl_c_classification --pretrained ./pretrained/internvl_c_13b_224px.pth␣
˓→--output result.json

Expected results:

{"dataset": "imagenet1k","model":"internvl_c_classification", "pretrained":"./
˓→pretrained/internvl_c_13b_224px.pth","task":"zeroshot_classification",
"metrics": {"acc1": 0.65686, "acc5": 0.85254, "mean_per_class_recall":0.
˓→ 6557799999999999 }, "language": "it"}

**Other Datasets**

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_classification" \
--dataset"cifar10" --dataset_root ./data/ --model internvl_c_classification\
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "cifar10","model":"internvl_c_classification","pretrained":"./pretrained/
˓→internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.9935,"acc5":0.9996,"mean_per_class_recall": 0.9935}, "language
˓→":"en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_classification" \
--dataset"cifar100" --dataset_root ./data/ --model internvl_c_classification\
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "cifar100", "model": "internvl_c_classification","pretrained": "./
˓→pretrained/internvl_c_13b_224px.pth","task":"zeroshot_classification",
"metrics": {"acc1": 0.9315,"acc5":0.9925,"mean_per_class_recall": 0.9314}, "language
˓→":"en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_classification" \
--dataset"mnist"--dataset_root ./data/ --model internvl_c_classification\
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "mnist", "model": "internvl_c_classification", "pretrained": "./pretrained/
˓→internvl_c_13b_224px.pth", "task": "zeroshot_classification",
(continues on next page)

**1.41. InternVL for Zero-Shot Image Classication & Image-Text Retrieval 439**


```
(continued from previous page)
```
"metrics": {"acc1": 0.806,"acc5":0.9743, "mean_per_class_recall": 0.8028667364603377},
˓→"language":"en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_classification" \
--dataset"caltech101"--dataset_root ./data/ --model internvl_c_classification\
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "caltech101","model":"internvl_c_classification", "pretrained":"./
˓→pretrained/internvl_c_13b_224px.pth","task":"zeroshot_classification",
"metrics": {"acc1": 0.8949037620297463,"acc5":0.9847987751531059, "mean_per_class_
˓→recall": 0.9548738053818752},"language":"en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_classification" \
--dataset"sun397"--dataset_root ./data/ --model internvl_c_classification\
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "sun397","model":"internvl_c_classification", "pretrained":"./pretrained/
˓→internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.7600180223256157,"acc5":0.9623370174890119, "mean_per_class_
˓→recall": 0.7641970904214413},"language":"en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_classification" \
--dataset"fgvc_aircraft" --dataset_root ./data/ --model internvl_c_classification\
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "fgvc_aircraft", "model": "internvl_c_classification","pretrained": "./
˓→pretrained/internvl_c_13b_224px.pth","task":"zeroshot_classification",
"metrics": {"acc1": 0.5271527152715272,"acc5":0.9426942694269427, "mean_per_class_
˓→recall": 0.5255169340463458},"language":"en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_classification" \
--dataset"country211"--dataset_root ./data/ --model internvl_c_classification\
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "country211","model":"internvl_c_classification", "pretrained":"./
˓→pretrained/internvl_c_13b_224px.pth","task":"zeroshot_classification",
"metrics": {"acc1": 0.34080568720379145,"acc5":0.6048815165876777, "mean_per_class_
˓→recall": 0.3406635071090047},"language":"en"}

**440 Chapter 1. Documentation**


CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_classification" \
--dataset"cars" --dataset_root ./data/ --model internvl_c_classification\
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "cars", "model": "internvl_c_classification","pretrained": "./pretrained/
˓→internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.9416739211540853,"acc5":0.99950254943415,"mean_per_class_recall
˓→":0.9416684924576828}, "language": "en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_classification" \
--dataset"birdsnap" --dataset_root ./data/birdsnap/ --model internvl_c_
˓→classification \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "birdsnap", "model": "internvl_c_classification","pretrained": "./
˓→pretrained/internvl_c_13b_224px.pth","task":"zeroshot_classification",
"metrics": {"acc1": 0.7203252032520325,"acc5":0.9636856368563685, "mean_per_class_
˓→recall": 0.7027551020408164},"language":"en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_classification" \
--dataset"dtd" --dataset_root ./data/ --model internvl_c_classification\
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "dtd","model":"internvl_c_classification","pretrained":"./pretrained/
˓→internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.7074468085106383,"acc5":0.9367021276595745, "mean_per_class_
˓→recall": 0.7079787234042553},"language":"en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_classification" \
--dataset"eurosat" --dataset_root ./data/ --model internvl_c_classification\
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "eurosat","model":"internvl_c_classification","pretrained":"./pretrained/
˓→internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.7937407407407407,"acc5":0.9984074074074074, "mean_per_class_
˓→recall": 0.8013766666666665},"language":"en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_classification" \
--dataset"fer2013" --dataset_root ./data/fer2013 --model internvl_c_classification\
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

**1.41. InternVL for Zero-Shot Image Classication & Image-Text Retrieval 441**


Expected results:

{"dataset": "fer2013","model":"internvl_c_classification","pretrained":"./pretrained/
˓→internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.561994984675397,"acc5": 0.9732516021175815,"mean_per_class_recall
˓→":0.5305440899910082}, "language": "en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_classification" \
--dataset"vtab/flowers" --dataset_root ./data/ --model internvl_c_classification\
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "vtab/flowers","model": "internvl_c_classification","pretrained":"./
˓→pretrained/internvl_c_13b_224px.pth","task":"zeroshot_classification",
"metrics": {"acc1": 0.8606277443486746,"acc5":0.953651000162628,"mean_per_class_recall
˓→":0.8563173902114554}, "language": "en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_classification" \
--dataset"food101" --dataset_root ./data/ --model internvl_c_classification\
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "food101","model":"internvl_c_classification","pretrained":"./pretrained/
˓→internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.9526336633663366,"acc5":0.9954851485148515, "mean_per_class_
˓→recall": 0.9527524752475246},"language":"en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_classification" \
--dataset"gtsrb"--dataset_root ./data/ --model internvl_c_classification\
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "gtsrb", "model": "internvl_c_classification", "pretrained": "./pretrained/
˓→internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.6548693586698338,"acc5":0.9089469517022961, "mean_per_class_
˓→recall": 0.5775180283147926},"language":"en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_classification" \
--dataset"pets" --dataset_root ./data/ --model internvl_c_classification\
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "pets", "model": "internvl_c_classification","pretrained": "./pretrained/
˓→internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.9604796947397111,"acc5":0.9991823385118561, "mean_per_class_
˓→recall": 0.9602545246926443},"language":"en"}

**442 Chapter 1. Documentation**


CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_classification" \
--dataset"renderedsst2" --dataset_root ./data/ --model internvl_c_classification\
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "renderedsst2","model": "internvl_c_classification","pretrained":"./
˓→pretrained/internvl_c_13b_224px.pth","task":"zeroshot_classification",
"metrics": {"acc1": 0.6792970895112576,"acc5": NaN,"mean_per_class_recall":0.
˓→ 6792944097041282 }, "language": "en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_classification" \
--dataset"vtab/resisc45" --dataset_root ./data/ --model internvl_c_classification\
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "vtab/resisc45", "model": "internvl_c_classification","pretrained": "./
˓→pretrained/internvl_c_13b_224px.pth","task":"zeroshot_classification",
"metrics": {"acc1": 0.7422631328360577,"acc5":0.9663545468973179, "mean_per_class_
˓→recall": 0.7481098478511045},"language":"en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_classification" \
--dataset"stl10"--dataset_root ./data/ --model internvl_c_classification\
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "stl10", "model": "internvl_c_classification", "pretrained": "./pretrained/
˓→internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.9945,"acc5":1.0,"mean_per_class_recall":0.9945},"language":
˓→"en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_classification" \
--dataset"voc2007" --dataset_root ./data/ --model internvl_c_classification\
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "voc2007","model":"internvl_c_classification","pretrained":"./pretrained/
˓→internvl_c_13b_224px.pth", "task": "zeroshot_classification",
"metrics": {"acc1": 0.7997462606837606,"acc5":0.9795005341880342, "mean_per_class_
˓→recall": 0.9048832641726575},"language":"en"}

**1.41.5 Evaluation: Zero-Shot Image-Text Retrieval**

**1.41. InternVL for Zero-Shot Image Classication & Image-Text Retrieval 443**


**Flickr30K & COCO**

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_retrieval" \
--dataset"flickr30k"--dataset_root ./data/flickr30k --model internvl_c_retrieval\
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "flickr30k", "model": "internvl_c_retrieval","pretrained": "./pretrained/
˓→internvl_c_13b_224px.pth", "task": "zeroshot_retrieval",
"metrics": {"image_retrieval_recall@1":0.8166000247001648, "text_retrieval_recall@1":0.
˓→ 9470000267028809 ,
"image_retrieval_recall@5":0.9603999853134155,"text_retrieval_recall@5":0.
˓→ 9959999918937683 ,
"image_retrieval_recall@10": 0.9819999933242798,"text_retrieval_recall@10": 0.
˓→ 9990000128746033 }, "language": "en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_retrieval" \
--dataset"mscoco_captions"--dataset_root ./data/mscoco_captions --model internvl_c_
˓→retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "mscoco_captions","model":"internvl_c_retrieval", "pretrained":"./
˓→pretrained/internvl_c_13b_224px.pth","task":"zeroshot_retrieval",
"metrics": {"image_retrieval_recall@1":0.5411835312843323, "text_retrieval_recall@1":0.
˓→ 7059999704360962 ,
"image_retrieval_recall@5":0.7731707096099854,"text_retrieval_recall@5":0.
˓→ 8902000188827515 ,
"image_retrieval_recall@10": 0.8463414907455444,"text_retrieval_recall@10": 0.
˓→ 9354000091552734 }, "language": "en"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_retrieval" \
--dataset"flickr30k"--dataset_root ./data/flickr30k --model internvl_g_retrieval_
˓→hf \
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json

Expected results:

{"dataset": "flickr30k", "model": "internvl_g_retrieval_hf","pretrained":"./pretrained/
˓→InternVL-14B-224px","task":"zeroshot_retrieval",
"metrics": {"image_retrieval_recall@1":0.8497999906539917, "text_retrieval_recall@1":0.
˓→ 9570000171661377 ,
"image_retrieval_recall@5":0.9700000286102295,"text_retrieval_recall@5":0.
˓→ 996999979019165 ,
"image_retrieval_recall@10": 0.98580002784729, "text_retrieval_recall@10":0.
˓→ 9990000128746033 }, "language": "en"}

**444 Chapter 1. Documentation**


CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_retrieval" \
--dataset"mscoco_captions"--dataset_root ./data/mscoco_captions --model internvl_g_
˓→retrieval_hf\
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json

Expected results:

{"dataset": "mscoco_captions","model":"internvl_g_retrieval_hf","pretrained": "./
˓→pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval",
"metrics": {"image_retrieval_recall@1":0.5858056545257568, "text_retrieval_recall@1":0.
˓→ 7491999864578247 ,
"image_retrieval_recall@5":0.813194751739502, "text_retrieval_recall@5":0.
˓→ 9129999876022339 ,
"image_retrieval_recall@10": 0.8795281648635864,"text_retrieval_recall@10": 0.
˓→ 9521999955177307 }, "language": "en"}

**Flickr30K-CN & COCO-CN**

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "cn"--task"zeroshot_retrieval" \
--dataset"flickr30k"--dataset_root ./data/flickr30k --model internvl_c_retrieval\
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "flickr30k", "model": "internvl_c_retrieval","pretrained": "./pretrained/
˓→internvl_c_13b_224px.pth", "task": "zeroshot_retrieval",
"metrics": {"image_retrieval_recall@1":0.7509999871253967, "text_retrieval_recall@1":0.
˓→ 902999997138977 ,
"image_retrieval_recall@5":0.9290000200271606,"text_retrieval_recall@5":0.
˓→ 9879999756813049 ,
"image_retrieval_recall@10": 0.9638000130653381,"text_retrieval_recall@10": 0.
˓→ 996999979019165 }, "language":"cn"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "cn"--task"zeroshot_retrieval" \
--dataset"mscoco_captions"--dataset_root ./data/mscoco_captions --model internvl_c_
˓→retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json

Expected results:

{"dataset": "mscoco_captions","model":"internvl_c_retrieval", "pretrained":"./
˓→pretrained/internvl_c_13b_224px.pth","task":"zeroshot_retrieval",
"metrics": {"image_retrieval_recall@1":0.6885090470314026, "text_retrieval_recall@1":0.
˓→ 6880000233650208 ,
"image_retrieval_recall@5":0.9192782640457153,"text_retrieval_recall@5":0.
˓→ 9200000166893005 ,
"image_retrieval_recall@10": 0.9648622870445251,"text_retrieval_recall@10": 0.
˓→ 9670000076293945 }, "language": "cn"}

**1.41. InternVL for Zero-Shot Image Classication & Image-Text Retrieval 445**


CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "cn"--task"zeroshot_retrieval" \
--dataset"flickr30k"--dataset_root ./data/flickr30k --model internvl_g_retrieval_
˓→hf \
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json

Expected results:

{"dataset": "flickr30k", "model": "internvl_g_retrieval_hf","pretrained":"./pretrained/
˓→InternVL-14B-224px","task":"zeroshot_retrieval",
"metrics": {"image_retrieval_recall@1":0.7767999768257141, "text_retrieval_recall@1":0.
˓→ 9290000200271606 ,
"image_retrieval_recall@5":0.9476000070571899,"text_retrieval_recall@5":0.
˓→ 9940000176429749 ,
"image_retrieval_recall@10": 0.9728000164031982,"text_retrieval_recall@10": 0.
˓→ 9980000257492065 }, "language": "cn"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "cn"--task"zeroshot_retrieval" \
--dataset"mscoco_captions"--dataset_root ./data/mscoco_captions --model internvl_g_
˓→retrieval_hf\
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json

Expected results:

{"dataset": "mscoco_captions","model":"internvl_g_retrieval_hf","pretrained": "./
˓→pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval",
"metrics": {"image_retrieval_recall@1":0.7378917336463928, "text_retrieval_recall@1":0.
˓→ 7139999866485596 ,
"image_retrieval_recall@5":0.9439696073532104,"text_retrieval_recall@5":0.
˓→ 9390000104904175 ,
"image_retrieval_recall@10": 0.9810066223144531,"text_retrieval_recall@10": 0.
˓→ 9769999980926514 }, "language": "cn"}

### XTD

```
model name EN ES FR ZH IT KO RU JP average
InternVL-C 97.3 95.7 95.1 95.6 96.0 92.2 93.3 95.5 95.1
InternVL-G 98.6 97.7 96.5 96.7 96.9 95.1 94.8 96.1 96.6
```
CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --task
˓→"zeroshot_retrieval" \
--dataset"multilingual_mscoco_captions"--dataset_root ./data/mscoco_captions --
˓→model internvl_c_retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json --language=en

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --task
˓→"zeroshot_retrieval" \
--dataset"multilingual_mscoco_captions"--dataset_root ./data/mscoco_captions --
(continues on next page)

**446 Chapter 1. Documentation**


```
(continued from previous page)
˓→model internvl_c_retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json --language=es
```
CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --task
˓→"zeroshot_retrieval" \
--dataset"multilingual_mscoco_captions"--dataset_root ./data/mscoco_captions --
˓→model internvl_c_retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json --language=fr

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --task
˓→"zeroshot_retrieval" \
--dataset"multilingual_mscoco_captions"--dataset_root ./data/mscoco_captions --
˓→model internvl_c_retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json --language=zh

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --task
˓→"zeroshot_retrieval" \
--dataset"multilingual_mscoco_captions"--dataset_root ./data/mscoco_captions --
˓→model internvl_c_retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json --language=it

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --task
˓→"zeroshot_retrieval" \
--dataset"multilingual_mscoco_captions"--dataset_root ./data/mscoco_captions --
˓→model internvl_c_retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json --language=ko

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --task
˓→"zeroshot_retrieval" \
--dataset"multilingual_mscoco_captions"--dataset_root ./data/mscoco_captions --
˓→model internvl_c_retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json --language=ru

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --task
˓→"zeroshot_retrieval" \
--dataset"multilingual_mscoco_captions"--dataset_root ./data/mscoco_captions --
˓→model internvl_c_retrieval \
--pretrained ./pretrained/internvl_c_13b_224px.pth --output result.json --language=jp

Expected results:

{"dataset": "multilingual_mscoco_captions","model":"internvl_c_retrieval", "pretrained
˓→":"./pretrained/internvl_c_13b_224px.pth", "task":"zeroshot_retrieval", "metrics": {
˓→"image_retrieval_recall@1": 0.7670000195503235, "text_retrieval_recall@1": 0.
˓→ 7480000257492065 , "image_retrieval_recall@5":0.9200000166893005, "text_retrieval_
˓→recall@5":0.921999990940094,"image_retrieval_recall@10":0.9670000076293945, "text_
˓→retrieval_recall@10": 0.9729999899864197},"language":"en"}
{"dataset": "multilingual_mscoco_captions","model":"internvl_c_retrieval", "pretrained
˓→":"./pretrained/internvl_c_13b_224px.pth", "task":"zeroshot_retrieval", "metrics": {
˓→"image_retrieval_recall@1": 0.7059999704360962, "text_retrieval_recall@1": 0.
˓→ 7009999752044678 , "image_retrieval_recall@5":0.9020000100135803, "text_retrieval_
˓→recall@5":0.8960000276565552, "image_retrieval_recall@10":0.9430000185966492, "text_
(continues on next page)

**1.41. InternVL for Zero-Shot Image Classication & Image-Text Retrieval 447**


(continued from previous page)
˓→retrieval_recall@10": 0.9570000171661377},"language":"es"}
{"dataset": "multilingual_mscoco_captions","model":"internvl_c_retrieval", "pretrained
˓→":"./pretrained/internvl_c_13b_224px.pth", "task":"zeroshot_retrieval", "metrics": {
˓→"image_retrieval_recall@1": 0.6970000267028809, "text_retrieval_recall@1": 0.
˓→ 6899999976158142 , "image_retrieval_recall@5":0.8830000162124634, "text_retrieval_
˓→recall@5":0.8889999985694885, "image_retrieval_recall@10":0.9350000023841858, "text_
˓→retrieval_recall@10": 0.9509999752044678},"language":"fr"}
{"dataset": "multilingual_mscoco_captions","model":"internvl_c_retrieval", "pretrained
˓→":"./pretrained/internvl_c_13b_224px.pth", "task":"zeroshot_retrieval", "metrics": {
˓→"image_retrieval_recall@1": 0.6480000019073486, "text_retrieval_recall@1": 0.
˓→ 6710000038146973 , "image_retrieval_recall@5":0.8759999871253967, "text_retrieval_
˓→recall@5":0.8769999742507935, "image_retrieval_recall@10":0.9419999718666077, "text_
˓→retrieval_recall@10": 0.9559999704360962},"language":"zh"}
{"dataset": "multilingual_mscoco_captions","model":"internvl_c_retrieval", "pretrained
˓→":"./pretrained/internvl_c_13b_224px.pth", "task":"zeroshot_retrieval", "metrics": {
˓→"image_retrieval_recall@1": 0.6790000200271606, "text_retrieval_recall@1": 0.
˓→ 7039999961853027 , "image_retrieval_recall@5":0.8989999890327454, "text_retrieval_
˓→recall@5":0.8999999761581421, "image_retrieval_recall@10":0.9440000057220459, "text_
˓→retrieval_recall@10": 0.9599999785423279},"language":"it"}
{"dataset": "multilingual_mscoco_captions","model":"internvl_c_retrieval", "pretrained
˓→":"./pretrained/internvl_c_13b_224px.pth", "task":"zeroshot_retrieval", "metrics": {
˓→"image_retrieval_recall@1": 0.5830000042915344, "text_retrieval_recall@1": 0.
˓→ 5920000076293945 , "image_retrieval_recall@5":0.8399999737739563, "text_retrieval_
˓→recall@5":0.8360000252723694, "image_retrieval_recall@10":0.9079999923706055, "text_
˓→retrieval_recall@10": 0.921999990940094}, "language": "ko"}
{"dataset": "multilingual_mscoco_captions","model":"internvl_c_retrieval", "pretrained
˓→":"./pretrained/internvl_c_13b_224px.pth", "task":"zeroshot_retrieval", "metrics": {
˓→"image_retrieval_recall@1": 0.6430000066757202, "text_retrieval_recall@1": 0.
˓→ 6439999938011169 , "image_retrieval_recall@5":0.8510000109672546, "text_retrieval_
˓→recall@5":0.8640000224113464, "image_retrieval_recall@10":0.9169999957084656, "text_
˓→retrieval_recall@10": 0.9330000281333923},"language":"ru"}
{"dataset": "multilingual_mscoco_captions","model":"internvl_c_retrieval", "pretrained
˓→":"./pretrained/internvl_c_13b_224px.pth", "task":"zeroshot_retrieval", "metrics": {
˓→"image_retrieval_recall@1": 0.6330000162124634, "text_retrieval_recall@1": 0.
˓→ 6759999990463257 , "image_retrieval_recall@5":0.875, "text_retrieval_recall@5": 0.
˓→ 8989999890327454 , "image_retrieval_recall@10":0.9359999895095825, "text_retrieval_
˓→recall@10":0.9549999833106995}, "language": "jp"}

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --task
˓→"zeroshot_retrieval" \
--dataset"multilingual_mscoco_captions"--dataset_root ./data/mscoco_captions --
˓→model internvl_g_retrieval_hf\
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json --language=en

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --task
˓→"zeroshot_retrieval" \
--dataset"multilingual_mscoco_captions"--dataset_root ./data/mscoco_captions --
˓→model internvl_g_retrieval_hf\
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json --language=es

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --task
˓→"zeroshot_retrieval" \
(continues on next page)

**448 Chapter 1. Documentation**


```
(continued from previous page)
--dataset"multilingual_mscoco_captions"--dataset_root ./data/mscoco_captions --
˓→model internvl_g_retrieval_hf\
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json --language=fr
```
CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --task
˓→"zeroshot_retrieval" \
--dataset"multilingual_mscoco_captions"--dataset_root ./data/mscoco_captions --
˓→model internvl_g_retrieval_hf\
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json --language=zh

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --task
˓→"zeroshot_retrieval" \
--dataset"multilingual_mscoco_captions"--dataset_root ./data/mscoco_captions --
˓→model internvl_g_retrieval_hf\
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json --language=it

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --task
˓→"zeroshot_retrieval" \
--dataset"multilingual_mscoco_captions"--dataset_root ./data/mscoco_captions --
˓→model internvl_g_retrieval_hf\
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json --language=ko

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --task
˓→"zeroshot_retrieval" \
--dataset"multilingual_mscoco_captions"--dataset_root ./data/mscoco_captions --
˓→model internvl_g_retrieval_hf\
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json --language=ru

CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --task
˓→"zeroshot_retrieval" \
--dataset"multilingual_mscoco_captions"--dataset_root ./data/mscoco_captions --
˓→model internvl_g_retrieval_hf\
--pretrained ./pretrained/InternVL-14B-224px --output result_g.json --language=jp

Expected results:

{"dataset": "multilingual_mscoco_captions","model":"internvl_g_retrieval_hf",
˓→"pretrained": "./pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval", "metrics
˓→": {"image_retrieval_recall@1": 0.8119999766349792,"text_retrieval_recall@1": 0.
˓→ 7979999780654907 , "image_retrieval_recall@5":0.9470000267028809, "text_retrieval_
˓→recall@5":0.9480000138282776, "image_retrieval_recall@10":0.9829999804496765, "text_
˓→retrieval_recall@10": 0.9860000014305115},"language":"en"}
{"dataset": "multilingual_mscoco_captions","model":"internvl_g_retrieval_hf",
˓→"pretrained": "./pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval", "metrics
˓→": {"image_retrieval_recall@1": 0.7549999952316284,"text_retrieval_recall@1": 0.
˓→ 7450000047683716 , "image_retrieval_recall@5":0.9350000023841858, "text_retrieval_
˓→recall@5":0.925000011920929,"image_retrieval_recall@10":0.9660000205039978, "text_
˓→retrieval_recall@10": 0.9769999980926514},"language":"es"}
{"dataset": "multilingual_mscoco_captions","model":"internvl_g_retrieval_hf",
˓→"pretrained": "./pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval", "metrics
˓→": {"image_retrieval_recall@1": 0.7450000047683716,"text_retrieval_recall@1": 0.
˓→ 7279999852180481 , "image_retrieval_recall@5":0.9179999828338623, "text_retrieval_
(continues on next page)

**1.41. InternVL for Zero-Shot Image Classication & Image-Text Retrieval 449**


(continued from previous page)
˓→recall@5":0.9190000295639038, "image_retrieval_recall@10":0.9620000123977661, "text_
˓→retrieval_recall@10": 0.9649999737739563},"language":"fr"}
{"dataset": "multilingual_mscoco_captions","model":"internvl_g_retrieval_hf",
˓→"pretrained": "./pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval", "metrics
˓→": {"image_retrieval_recall@1": 0.6980000138282776,"text_retrieval_recall@1": 0.
˓→ 6949999928474426 , "image_retrieval_recall@5":0.9120000004768372, "text_retrieval_
˓→recall@5":0.9110000133514404, "image_retrieval_recall@10":0.9620000123977661, "text_
˓→retrieval_recall@10": 0.9670000076293945},"language":"zh"}
{"dataset": "multilingual_mscoco_captions","model":"internvl_g_retrieval_hf",
˓→"pretrained": "./pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval", "metrics
˓→": {"image_retrieval_recall@1": 0.7329999804496765,"text_retrieval_recall@1": 0.
˓→ 7450000047683716 , "image_retrieval_recall@5":0.9309999942779541, "text_retrieval_
˓→recall@5":0.9309999942779541, "image_retrieval_recall@10":0.9639999866485596, "text_
˓→retrieval_recall@10": 0.968999981880188}, "language": "it"}
{"dataset": "multilingual_mscoco_captions","model":"internvl_g_retrieval_hf",
˓→"pretrained": "./pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval", "metrics
˓→": {"image_retrieval_recall@1": 0.6430000066757202,"text_retrieval_recall@1": 0.
˓→ 6470000147819519 , "image_retrieval_recall@5":0.8790000081062317, "text_retrieval_
˓→recall@5":0.8769999742507935, "image_retrieval_recall@10":0.9419999718666077, "text_
˓→retrieval_recall@10": 0.9509999752044678},"language":"ko"}
{"dataset": "multilingual_mscoco_captions","model":"internvl_g_retrieval_hf",
˓→"pretrained": "./pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval", "metrics
˓→": {"image_retrieval_recall@1": 0.6850000023841858,"text_retrieval_recall@1": 0.
˓→ 6899999976158142 , "image_retrieval_recall@5":0.8740000128746033, "text_retrieval_
˓→recall@5":0.8920000195503235, "image_retrieval_recall@10":0.9390000104904175, "text_
˓→retrieval_recall@10": 0.9480000138282776},"language":"ru"}
{"dataset": "multilingual_mscoco_captions","model":"internvl_g_retrieval_hf",
˓→"pretrained": "./pretrained/InternVL-14B-224px", "task": "zeroshot_retrieval", "metrics
˓→": {"image_retrieval_recall@1": 0.6850000023841858,"text_retrieval_recall@1": 0.
˓→ 703000009059906 ,"image_retrieval_recall@5": 0.9020000100135803,"text_retrieval_
˓→recall@5":0.9100000262260437, "image_retrieval_recall@10":0.9539999961853027, "text_
˓→retrieval_recall@10": 0.9610000252723694},"language":"jp"}

**1.41.6 Citation**

If yound this project useful in your research, please consider citing:

@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
˓→Recognition},
pages={24185--24198},
year={2024}
}

**450 Chapter 1. Documentation**


**1.41.7 Original README of CLIP Benchmark**

The goal of this repo is to evaluate CLIP-like models on a standard set of datasets on dierent tasks such as zero-shot
classication and zero-shot retrieval.

Below we show the average rank (1 is the best, lower is better) of dierent CLIP models, evaluated on dierent datasets.

The current detailed results of the benchmark can be seen _here_ or directly in the _notebook_.

**Features**

- Support for zero-shot classication and zero-shot retrieval
- Support forOpenCLIPpre-trained models
- Support various datasets fromtorchvision,tensorow datasets, andVTAB.
- SupportJapanese CLIP by rinna

**How to install?**

pip install clip-benchmark

**How to use?**

To evaluate we recommend to create a models.txt like

ViT-B- 32 ,openai

to get the list of datasets

wget https://raw.githubusercontent.com/LAION-AI/CLIP_benchmark/main/benchmark/
˓→webdatasets.txt

Then to run

clip_benchmarkeval --pretrained_model models.txt \
--dataset"webdatasets.txt"\
--dataset_root"https://huggingface.co/datasets/clip-benchmark/wds_{dataset_cleaned}/
˓→tree/main" \
--output "benchmark_{dataset}_{pretrained}_{model}_{language}_{task}.json"

**1.41. InternVL for Zero-Shot Image Classication & Image-Text Retrieval 451**


Then to get the full table

clip_benchmark build benchmark_*.json--output benchmark.csv

**Command line interface (CLI)**

The easiest way to benchmark the models is using the CLI, clip_benchmark. You can specify the model to use, the
dataset and the task to evaluate on. Once it is done, evaluation is performed and the results are written into a JSONle.

**Using other models than openclip**

It is possible to use other models than openclip ones. For example japanese-clip is supported

Here is an example of use

>>>python3 clip_benchmark/cli.py eval\
--model_type "ja_clip" \ # flag to use japanese-clip
--pretrained "rinna/japanese-cloob-vit-b-16" \ # now, we have rinna/japanese-cloob-
˓→vit-b-16 or rinna/japanese-clip-vit-b-16.
--language "jp" \
--task "zeroshot_classification" \
--dataset "imagenet1k" \
--dataset_root {ROOT_PATH}

>>>cat result.json
{"dataset": "imagenet1k", "model": "ViT-B-32-quickgelu", "pretrained": "rinna/japanese-
˓→cloob-vit-b-16", "task": "zeroshot_classification", "metrics": {"acc1": 0.54636, "acc5
˓→": 0.72856, "mean_per_class_recall": 0.54522}, "language": "jp"}

**How to add other CLIP models**

Please follow these steps:

1. Add a identityle to load model in clip_benchmark/models
2. Dene a loading function, that returns a tuple (model, transform, tokenizer). Please see clip_benchmark/
    models/open_clip.py as an example.
3. Add the function into TYPE2FUNC in clip_benchmark/models/__init__.py

Remarks:

- The new tokenizer/model must enable to do the following things as https://github.com/openai/CLIP#usage
    **-** tokenizer(texts).to(device) ... texts is a list of string
    **-** model.encode_text(tokenized_texts) ... tokenized_texts is a output from
       tokenizer(texts).to(device)
    **-** model.encode_image(images) ... images is a image tensor by the transform

**CIFAR-10 example**

Here is an example for CIFAR-10 zero-shot classication using OpenCLIP’s pre-trained model on LAION-400m:

clip_benchmark eval --dataset=cifar10 --task=zeroshot_classification
--pretrained=laion400m_e32 --model=ViT-B-32-quickgelu --output=result.json
--batch_size=64

**452 Chapter 1. Documentation**


By default, the dataset is downloaded into --dataset_root, which by default is root.

Here is the content of result.json after the evaluation is done:

{
"dataset":"cifar10","model":"ViT-B-32-quickgelu",
"pretrained":"laion400m_e32","task": "zeroshot_classification",
"metrics": {"acc1": 0.9074,"acc5":0.998}
}

**VOC2007 example**

Here is another example with VOC2007, which is a multi-label classication dataset.

clip_benchmark eval --dataset=voc2007_multilabel --task=zeroshot_classification
--pretrained=laion400m_e32 --model=ViT-B-32-quickgelu --output=result.json
--batch_size=64

Here is the content of result.json after the evaluation is done:

{"dataset": "voc2007_multilabel", "model": "ViT-B-32-quickgelu","pretrained":
˓→"laion400m_e32","task":"zeroshot_classification","metrics": {"mean_average_precision
˓→":0.7627869844436646}}

Here, we compute the mean average precision or mAP, more details about that metricherein the context of multi-label
classication.

**VTAB example**

Here is an example on how to run it onVTABclassication tasks. First, you need to install VTAB’s dedicated package.

pip install task_adaptation==0.1

Then, you can run it by providing the full dataset name. Example with eurosat:

clip_benchmark eval --dataset=vtab/eurosat --task=zeroshot_classification
--pretrained=laion400m_e32 --model=ViT-B-32-quickgelu --output=result.json
--batch_size=64

See _clip_benchmark/datasets/builder.py#L634_ for the full list of VTAB dataset collection.

**TensorFlow dataset example**

Here is an example on how to run it onTensorow datasets. First, you need to install tfds-nightly and timm.

pip install timm tfds-nightly

The name of the dataset follows the template tfds/<DATASET_NAME>.

Example with cifar10:

clip_benchmark eval --dataset=tfds/cifar10 --task=zeroshot_classification
--pretrained=laion400m_e32 --model=ViT-B-32-quickgelu --output=result.json
--batch_size=64

**1.41. InternVL for Zero-Shot Image Classication & Image-Text Retrieval 453**


**COCO captions example**

Here is an example for COCO captions zero-shot retrieval:

clip_benchmark eval --dataset=mscoco_captions --task=zeroshot_retrieval
--pretrained=laion400m_e32 --model=ViT-B-32-quickgelu --output=result.json
--batch_size=64

Note that for using COCO, you also need to install pycocotools (e.g., using pip install pycocotools).

**Webdataset example**

Here is an example on how to run it onwebdatasets. First, you need to install webdataset.

pip install webdataset

**Creating a webdataset**

You can either convert an already supported CLIP_benchmark dataset to webdataset format, or manually create your
own with the samele structure. For already supported datasets use the CLI command clip_benchmark_export_wds
as in this example:

$ clip_benchmark_export_wds --dataset cifar10 --split train --dataset_root DATA_DIR/ --
˓→output wds_cifar10/
$ clip_benchmark_export_wds --dataset cifar10 --split test --dataset_root DATA_DIR/ --
˓→output wds_cifar10/

which will convert the train and test splits for CIFAR-10 (downloaded to DATA_DIR/) and save the webdataset to
wds_cifar10/ (upload to Huggingface Hub must be done manually for now). Retrieval datasets are also supported
with the --retrievalag.

For other datasets, data must be stored with the followingle structure:

root_dir/
train/
nshards.txt
0.tar
1.tar
...
test/
nshards.txt
0.tar
...
classnames.txt
zeroshot_classification_templates.txt
dataset_type.txt

Each split should be contained in its own folder and nshards.txt should contain a single integer corresponding to
the number of TARles. The TARles should follow webdataset format, with an imagele (.webp, .png, or .jpg) and
a label (.cls) for each example. Classnames and templates are required for zeroshot classication evaluation, with each
classname or template on its own line. Dataset type is required for distinguishing zeroshot retrieval evaluation: thele
should just contain the text retrieval.

**454 Chapter 1. Documentation**


**Evaluating on a webdataset**

The name of the dataset follows the template wds/<DATASET_NAME>. Note that the dataset name currently only aects
the name in the results output - classnames and templates are loaded directly from the includedles. The dataset root
directory can be either a local path to the root_dir as specied above, or an HTTP URL pointing to a Huggingface
Hub datasetle tree.

Example with vtab/cifar10:

$ clip_benchmark eval --dataset wds/vtab/cifar10 --dataset_root ROOT_DIR/wds_vtab-
˓→cifar10/
$ clip_benchmark eval --dataset wds/vtab/cifar10 --dataset_root https://huggingface.co/
˓→datasets/clip-benchmark/wds_vtab-cifar10/tree/main

All other arguments remain the same as in the other examples. See https://huggingface.co/clip-benchmark
for a full list of datasets that have already been uploaded to Huggingface.

**Evaluate mulitple models on multiple datasets**

For the purpose of benchmarking, it is possible to run the CLI with multiple pre-trained models on multiple datasets.

**Pretrained models and datasets list as arguments**

For models, we can provide list of pretrained model names in the form of ‘model,pretrained’ (so model and
pretrained are comma separated). For datasets, we can provide a list of datasets. For languages, we can provide
a list of languages. Example:

clip_benchmarkeval --pretrained_model ViT-B-32-quickgelu,laion400m_e32 ViT-L-14,
˓→laion400m_e32 \
--dataset cifar10 cifar100 --dataset_root "clip_benchmark_datasets/{dataset}"--language␣
˓→en jp \
--output"{dataset}_{pretrained}_{model}_{language}_{task}.json"

Note that --dataset_root and --output can be now in the form of a template that depends on the
dataset/model/language/task (for --output) and dataset name (for --dataset_root).

Note that If the benchmark fails at some point, it is possible to resume it by skipping already evaluated models using
--skip_existing.

**Pretrained models and datasets list asles**

We can also provide a path to les with models (each line is in the form of ‘model,pretrained’ where model and
pretrained are comma separated) and datasets list (one dataset per line):

clip_benchmarkeval --pretrained_model benchmark/models.txt\
--dataset benchmark/datasets.txt --dataset_root"clip_benchmark_datasets/{dataset}" \
--output"{dataset}_{pretrained}_{model}_{language}_{task}.json"

Examples are available in _benchmark/datasets.txt_ and _benchmark/models.txt_

**Model and dataset collections**

We can also provide model collection names (openai, openclip_base, openclip_multilingual, openclip_full
are supported) or dataset collection names (vtab, vtab+, retrieval, imagenet_robustness are supported):

**1.41. InternVL for Zero-Shot Image Classication & Image-Text Retrieval 455**


clip_benchmarkeval --pretrained_model openai openclip_base --dataset vtab+ retrieval\
--dataset_root"clip_benchmark_datasets/{dataset}" --not quiet\
--output"{dataset}_{pretrained}_{model}_{language}_{task}.json"

**Development**

For development, you can also do this:

git clone https://github.com/LAION-AI/CLIP_benchmark
cd CLIP_benchmark
python setup.py install

**Credits**

- Thanks toOpenCLIPauthors, zero-shot accuracy code is adapted from there and pre-trained models are used in
    the command line interface.
- Thanks toSLIPauthors, some zero-shot templates and classnames are from there.
- Thanks toWise-ftauthors, Imagenet robustness datasets code is adapted from there
- Thanks toLiTauthors, some zero-shot templates and classnames of VTAB datasets are from there.
- This package was created withCookiecutterand theaudreyr/cookiecutter-pypackageproject template. Thanks
    to the author.

## 1.42 InternViT-6B for Semantic Segmentation

This folder contains the implementation of the InternViT-6B for semantic segmentation, which is developed on top of
MMSegmentation v0.30.0, corresponding to Section 4.2.2 of ourInternVL 1.0 paper.

In this part, we validate the visual perception capabilities of InternViT-6B, the most core component of InternVL
1.0. To investigate the pixel-level perceptual capacity of InternViT-6B, we conduct extensive experiments of semantic
segmentation on the ADE20K dataset.

**1.42.1 Data Preparation**

To set up your dataset for segmentation, it is recommended to symlink the dataset root to segmentation/data. If
your folder structure is dierent, you may need to adjust the corresponding paths in the congles.

segmentation
data
ade
ADEChallengeData2016
annotations
training
validation
images
training
validation

The training and validation set of ADE20K could be download from thislink.

If you want to use other datasets, please refer to theguidelinesin MMSegmentation.

**456 Chapter 1. Documentation**


**1.42.2 Model Preparation**

```
model name type param download size
intern_vit_6b_224px.pth pytorch 6B HF link 12 GB
```
Download the above model weight and place it in the pretrained/ folder:

mkdir pretrained&& cdpretrained
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/intern_vit_6b_224px.pth

The directory structure should be:

pretrained
intern_vit_6b_224px.pth

**1.42.3 Training**

```
Please note, this open-source code does not include DeepSpeed in MMSegmentation, so it currently only
supports training for linear probing and head tuning, and does not support full-parameter training.
If you want to train a super-large segmentation model, please refer tothis codebase.
```
To train a linear classier for InternViT-6B with 8 GPU on 1 node (total batch size 16), run:

sh dist_train.sh configs/intern_vit_6b/linear_probing/linear_intern_vit_6b_504_80k_
˓→ade20k_bs16_lr4e-5_frozen.py 8
# or manage jobs with slurm
GPUS= 8 sh slurm_train.sh <partition> <job-name> configs/intern_vit_6b/linear_probing/
˓→linear_intern_vit_6b_504_80k_ade20k_bs16_lr4e-5_frozen.py

Note, it is normal for the following information to appear during training and it can be safely ignored:

```
INFO:mmseg:_IncompatibleKeys(missing_keys=[], unexpected_keys=[‘clip_projector.norm1_q.weight’,
‘clip_projector.norm1_q.bias’, ‘clip_projector.norm1_k.weight’, ‘clip_projector.norm1_k.bias’,
‘clip_projector.norm1_v.weight’, ‘clip_projector.norm1_v.bias’, ‘clip_projector.cross_attn.q_bias’,
‘clip_projector.cross_attn.k_bias’, ‘clip_projector.cross_attn.v_bias’, ‘clip_projector.cross_attn.q.weight’,
‘clip_projector.cross_attn.k.weight’, ‘clip_projector.cross_attn.v.weight’,
‘clip_projector.cross_attn.proj.weight’, ‘clip_projector.cross_attn.proj.bias’])
```
**1.42.4 Evaluation**

```
type backbone head mIoU config download
few-shot (1/16) InternViT-6B Linear 46.5 cong ckpt|log
few-shot (1/8) InternViT-6B Linear 50.0 cong ckpt|log
few-shot (1/4) InternViT-6B Linear 53.3 cong ckpt|log
few-shot (1/2) InternViT-6B Linear 55.8 cong ckpt|log
few-shot (1/1) InternViT-6B Linear 57.2 cong ckpt|log
linear probing InternViT-6B (frozen) Linear 47.2 cong ckpt|log
head tuning InternViT-6B (frozen) UperNet 54.9 cong ckpt|log
full tuning InternViT-6B UperNet 58.9 cong ckpt|log
```
You can download checkpoints fromhereor from the table above. Then place them to segmentation/checkpoints/.

**1.42. InternViT-6B for Semantic Segmentation 457**


For example, to evaluate the InternViT-6B with a single GPU:

python test.py configs/intern_vit_6b/linear_probing/linear_intern_vit_6b_504_80k_ade20k_
˓→bs16_lr4e-5_frozen.py checkpoints/linear_intern_vit_6b_504_80k_ade20k_bs16_lr4e-5_
˓→frozen.pth --eval mIoU

For example, to evaluate the InternViT-6B with a single node with 8 GPUs:

sh dist_test.sh configs/intern_vit_6b/linear_probing/linear_intern_vit_6b_504_80k_ade20k_
˓→bs16_lr4e-5_frozen.py checkpoints/linear_intern_vit_6b_504_80k_ade20k_bs16_lr4e-5_
˓→frozen.pth 8 --eval mIoU

**1.42.5 Citation**

If yound this project useful in your research, please consider citing:

@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
˓→Recognition},
pages={24185--24198},
year={2024}
}

## 1.43 InternVL for Multimodal Dialogue using LLaVA Codebase

This folder contains the implementation of the InternVL-Chat V1.0, which corresponds to Section 4.4 of ourInternVL
1.0 paper.

In this part, we mainly use theLLaVA codebaseto evaluate InternVL (InternViT-6B) in creating MLLMs. Thanks for
this great work. We have retained the original documentation of LLaVA-1.5 as a more detailed manual. In most cases,
you will only need to refer to the new documentation that we have added.

```
Note: To unify the environment across dierent tasks, we have made some compatibility modications to
the LLaVA-1.5 code, allowing it to support transformers==4.37.2 (originally locked at 4.31.0). Please
note that transformers==4.37.2 should be installed.
```
**1.43.1 Installation**

First, follow the _installation guide_ to perform some basic installations.

In addition, using this codebase requires executing the following steps:

- Install other requirements:

```
pip install --upgrade pip # enable PEP 660 support
pip install -e.
```
**458 Chapter 1. Documentation**


**1.43.2 Model Preparation**

```
model name type param download size
InternViT-6B-224px huggingface 6B HF link 12 GB
InternViT-6B-448px-V1-0 huggingface 6B HF link 12 GB
vicuna-13b-v1.5 huggingface 13B HF link 13.5 GB
vicuna-7b-v1.5 huggingface 7B HF link 26.1 GB
```
Download the above model weights and place them in the pretrained/ folder.

cd pretrained/
# pip install -U huggingface_hub
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternViT-6B-224px --local-dir InternViT-6B-224px
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternViT-6B-448px-V1-0 --local-dir InternViT-6B-448px
huggingface-cli download --resume-download --local-dir-use-symlinks False lmsys/vicuna-
˓→13b-v1.5 --local-dir vicuna-13b-v1.5
huggingface-cli download --resume-download --local-dir-use-symlinks False lmsys/vicuna-
˓→7b-v1.5 --local-dir vicuna-7b-v1.5

The directory structure is:

pretrained
InternViT-6B-224px/
InternViT-6B-448px/
vicuna-13b-v1.5/
vicuna-7b-v1.5/

**1.43.3 Training**

- InternViT-6B-224px + Vicuna-7B:

# pretrain
CUDA_VISIBLE_DEVICES= 0 ,1,2,3,4,5,6,7 sh scripts_internvl/pretrain_internvit6b_224to336_
˓→vicuna7b.sh
# finetune
CUDA_VISIBLE_DEVICES= 0 ,1,2,3,4,5,6,7 sh scripts_internvl/finetune_internvit6b_224to336_
˓→vicuna7b.sh

- InternViT-6B-224px + Vicuna-13B:

# pretrain
CUDA_VISIBLE_DEVICES= 0 ,1,2,3,4,5,6,7 sh scripts_internvl/pretrain_internvit6b_224to336_
˓→vicuna13b.sh
# finetune
CUDA_VISIBLE_DEVICES= 0 ,1,2,3,4,5,6,7 sh scripts_internvl/finetune_internvit6b_224to336_
˓→vicuna13b.sh

- InternViT-6B-448px + Vicuna-7B:

**1.43. InternVL for Multimodal Dialogue using LLaVA Codebase 459**


# pretrain
CUDA_VISIBLE_DEVICES= 0 ,1,2,3,4,5,6,7 sh scripts_internvl/pretrain_internvit6b_448_
˓→vicuna7b.sh
# finetune
CUDA_VISIBLE_DEVICES= 0 ,1,2,3,4,5,6,7 sh scripts_internvl/finetune_internvit6b_448_
˓→vicuna7b.sh

- InternViT-6B-448px + Vicuna-13B:

# pretrain
CUDA_VISIBLE_DEVICES= 0 ,1,2,3,4,5,6,7 sh scripts_internvl/pretrain_internvit6b_448_
˓→vicuna13b.sh
# finetune
CUDA_VISIBLE_DEVICES= 0 ,1,2,3,4,5,6,7 sh scripts_internvl/finetune_internvit6b_448_
˓→vicuna13b.sh

**1.43.4 Model Zoo**

```
method vision
en-
coder
```
```
LLM res. VQAv2GQAVizWizSQA TextVQAPOPEMMEMMBMM-
BCN
```
```
MMVetDown-
load
```
```
LLaVA-
1.5
```
### CLIP-

### L-

```
336px
```
### V-

### 7B

### 336 78.5 62.0 50.0 66.8 58.2 85.9 1510.764.3 58.3 30.5 HF

```
link
```
```
LLaVA-
1.5
```
### CLIP-

### L-

```
336px
```
### V-

### 13B

### 336 80.0 63.3 53.6 71.6 61.3 85.9 1531.367.7 63.6 35.4 HF

```
link
```
```
InternVL-
Chat-1.0
```
```
IViT-
6B-
224px
```
### V-

### 7B

### 336 79.3 62.9 52.5 66.2 57.0 86.4 1525.164.6 57.6 31.2 HF

```
link
```
```
InternVL-
Chat-1.0
```
```
IViT-
6B-
224px
```
### V-

### 13B

### 336 80.2 63.9 54.6 70.1 58.7 87.1 1546.966.5 61.9 33.7 HF

```
link
```
```
InternVL-
Chat-1.0
```
```
IViT-
6B-
448px
```
### V-

### 13B

### 448 82.0 64.1 60.1 71.6 64.8 87.2 1579.068.2 64.0 36.7 HF

```
link
```
Download the above model weights and place them in the pretrained/ folder.

cd pretrained/
# pip install -U huggingface_hub
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL-Chat-ViT-6B-Vicuna-7B --local-dir InternVL-Chat-ViT-6B-Vicuna-7B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL-Chat-ViT-6B-Vicuna-13B --local-dir InternVL-Chat-ViT-6B-Vicuna-13B
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL-Chat-ViT-6B-Vicuna-13B-448px --local-dir InternVL-Chat-ViT-6B-Vicuna-13B-448px

The directory structure is:

**460 Chapter 1. Documentation**


pretrained
InternViT-6B-224px/
InternViT-6B-448px/
vicuna-13b-v1.5/
vicuna-7b-v1.5/
InternVL-Chat-ViT-6B-Vicuna-7B/
InternVL-Chat-ViT-6B-Vicuna-13B/
InternVL-Chat-ViT-6B-Vicuna-13B-448px/

**1.43.5 Demo**

The method for deploying the web demo is consistent with LLaVA-1.5. You only need to change the model path. The
specic steps are as follows:

**Launch a controller**

python -m llava.serve.controller --host 0 .0.0.0 --port 10000

**Launch a gradio web server**

python -m llava.serve.gradio_web_server --controller [http://localhost:10000](http://localhost:10000) --model-list-
˓→mode reload --port 10038

**Launch a model worker**

# OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B
python -m llava.serve.model_worker --host 0 .0.0.0 --controller [http://localhost:10000](http://localhost:10000) --
˓→port 40000 --worker [http://localhost:40000](http://localhost:40000) --model-path ./pretrained/InternVL-Chat-ViT-
˓→6B-Vicuna-7B
# OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B
python -m llava.serve.model_worker --host 0 .0.0.0 --controller [http://localhost:10000](http://localhost:10000) --
˓→port 40001 --worker [http://localhost:40001](http://localhost:40001) --model-path ./pretrained/InternVL-Chat-ViT-
˓→6B-Vicuna-13B

After completing the above steps, you can access the web demo at [http://localhost:10038](http://localhost:10038) and see
the following page. Note that the models deployed here are InternVL-Chat-ViT-6B-Vicuna-7B and
InternVL-Chat-ViT-6B-Vicuna-13B, which are the two models of our InternVL 1.0. The only dierence from
LLaVA-1.5 is that the CLIP-ViT-300M has been replaced with our InternViT-6B.

If you need a more eective MLLM, please check out our InternVL2 series models. For more details on deploying the
demo, please refer to _here_.

**1.43. InternVL for Multimodal Dialogue using LLaVA Codebase 461**


**1.43.6 Testing**

The method for testing the model remains the same as LLaVA-1.5; you just need to change the path of the script. Our
scripts are located in scripts_internvl/.

For example, testing MME using a single GPU:

sh scripts_internvl/eval/mme.sh pretrained/InternVL-Chat-ViT-6B-Vicuna-7B/

**1.43.7 Citation**

If yound this project useful in your research, please consider citing:

@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
˓→Recognition},
pages={24185--24198},
year={2024}
}

**462 Chapter 1. Documentation**


**1.43.8 LLaVA: Large Language and Vision Assistant**

_Visual instruction tuning towards large language and vision models with GPT-4 level capabilities._

[Project Page] [Demo] [Data] [Model Zoo]

Community Contributions: [llama.cpp] [Colab] [Space]

**Improved Baselines with Visual Instruction Tuning** [Paper]Haotian Liu,Chunyuan Li,Yuheng Li,Yong Jae Lee

**Visual Instruction Tuning** (NeurIPS 2023, **Oral** ) [Paper]Haotian Liu*,Chunyuan Li*,Qingyang Wu,Yong Jae Lee
(*Equal Contribution)

**Release**

- [10/12] Check out the Korean LLaVA (Ko-LLaVA), created by ETRI, who has generously supported our re-
    search! [Demo]
- [10/12] LLaVA is now supported inllama.cppwith 4-bit / 5-bit quantization support!
- [10/11] The training data and scripts of LLaVA-1.5 are releasedhere, and evaluation scripts are releasedhere!
- [10/5] LLaVA-1.5 is out! Achieving SoTA on 11 benchmarks, with just simple modications to the original
    LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods
    like Qwen-VL-Chat that use billion-scale data. Check out thetechnical report, and explore thedemo! Models
    are available inModel Zoo.
- [9/26] LLaVA is improved with reinforcement learning from human feedback (RLHF) to improve fact grounding
    and reduce hallucination. Check out the new SFT and RLHF checkpoints at project[LLavA-RLHF]
- [9/22]LLaVAis accepted by NeurIPS 2023 as **oral presentation** , andLLaVA-Medis accepted by NeurIPS 2023
    Datasets and Benchmarks Track as **spotlight presentation**.
- [9/20] We summarize our empirical study of training 33B and 65B LLaVA models in anote. Further, if you are
    interested in the comprehensive review, evolution and trend of multimodal foundation models, please check out
    our recent survey paper``Multimodal Foundation Models: From Specialists to General-Purpose Assistants”.
- [7/19] We release a major upgrade, including support for LLaMA-2, LoRA training, 4-/8-bit inference, higher
    resolution (336x336), and a lot more. We releaseLLaVA Benchfor benchmarking open-ended visual chat with
    results from Bard and Bing-Chat. We also support and verify training with RTX 3090 and RTX A6000. Check
    outLLaVA-from-LLaMA-2, and ourmodel zoo!
- [6/26]CVPR 2023 Tutorialon **Large Multimodal Models: Towards Building and Surpassing Multimodal**
    **GPT-4**! Please check out [Slides] [Notes] [YouTube] [Bilibli].
- [6/11] We released the preview for the most requested feature: DeepSpeed and LoRA support! Please see docu-
    mentations _here_.
- [6/1] We released **LLaVA-Med: Large Language and Vision Assistant for Biomedicine** , a step towards build-
    ing biomedical domain large language and vision models with GPT-4 level capabilities. Checkout thepaperand
    page.
- [5/6] We are releasingLLaVA-Lighting-MPT-7B-preview, based on MPT-7B-Chat! See _here_ for more details.
- [5/2] We are releasing LLaVA-Lighting! Train a lite, multimodal GPT-4 with just $40 in 3 hours! See _here_ for
    more details.
- [4/27] Thanks to the community eort, LLaVA-13B with 4-bit quantization allows you to run on a GPU with as
    few as 12GB VRAM! Try it outhere.

**1.43. InternVL for Multimodal Dialogue using LLaVA Codebase 463**


- [4/17] We released **LLaVA: Large Language and Vision Assistant**. We propose visual instruction tuning,
    towards building large language and vision models with GPT-4 level capabilities. Checkout thepaperanddemo.
**Usage and License Notices** : The data and checkpoint is intended and licensed for research use only. They are
also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC
4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research
purposes.

**Contents**

- _Install_
- _LLaVA Weights_
- _Demo_
- Model Zoo
- Dataset
- _Train_
- _Evaluation_

**Install**

1. Clone this repository and navigate to LLaVA folder

```
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
```
2. Install Package

```
conda create -n llavapython= 3 .10 -y
conda activate llava
pip install --upgrade pip # enable PEP 660 support
pip install -e.
```
3. Install additional packages for training cases

```
pip install ninja
pip install flash-attn--no-build-isolation
```
**Upgrade to latest code base**

git pull
pip uninstall transformers
pip install -e.

**LLaVA Weights**

Please check out ourModel Zoofor all public LLaVA checkpoints, and the instructions of how to use the weights.

**464 Chapter 1. Documentation**


**Demo**

To run our demo, you need to prepare LLaVA checkpoints locally. Please follow the instructions _here_ to download the
checkpoints.

**Gradio Web UI**

To launch a Gradio demo locally, please run the following commands one by one. If you plan to launch multiple model
workers to compare between dierent checkpoints, you only need to launch the controller and the web server _ONCE_.

**Launch a controller**

python -m llava.serve.controller --host 0 .0.0.0 --port 10000

**Launch a gradio web server.**

python -m llava.serve.gradio_web_server --controller [http://localhost:10000](http://localhost:10000) --model-list-
˓→mode reload

You just launched the Gradio web interface. Now, you can open the web interface with the URL printed on the screen.
You may notice that there is no model in the model list. Do not worry, as we have not launched any model worker yet.
It will be automatically updated when you launch a model worker.

**Launch a model worker**

This is the actual _worker_ that performs the inference on the GPU. Each worker is responsible for a single model specied
in --model-path.

python -m llava.serve.model_worker --host 0 .0.0.0 --controller [http://localhost:10000](http://localhost:10000) --
˓→port 40000 --worker [http://localhost:40000](http://localhost:40000) --model-path liuhaotian/llava-v1.5-13b

Wait until the processnishes loading the model and you see “Uvicorn running on ...”. Now, refresh your Gradio web
UI, and you will see the model you just launched in the model list.

You can launch as many workers as you want, and compare between dierent model checkpoints in the same Gradio
interface. Please keep the --controller the same, and modify the --port and --worker to a dierent port number
for each worker.

python -m llava.serve.model_worker --host 0 .0.0.0 --controller [http://localhost:10000](http://localhost:10000) --
˓→port <different from 40000 , say 40001 > --worker [http://localhost:<change](http://localhost:<change) accordingly,␣
˓→i.e. 40001 > --model-path <ckpt2>

If you are using an Apple device with an M1 or M2 chip, you can specify the mps device by using the --deviceag:
--device mps.

**Launch a model worker (Multiple GPUs, when GPU VRAM <= 24GB)**

If the VRAM of your GPU is less than 24GB (e.g., RTX 3090, RTX 4090, etc.), you may try running it with multiple
GPUs. Our latest code base will automatically try to use multiple GPUs if you have more than one GPU. You can
specify which GPUs to use with CUDA_VISIBLE_DEVICES. Below is an example of running with therst two GPUs.

CUDA_VISIBLE_DEVICES= 0 ,1 python -m llava.serve.model_worker --host 0 .0.0.0 --controller␣
˓→http://localhost:10000 --port 40000 --worker [http://localhost:40000](http://localhost:40000) --model-path␣
˓→liuhaotian/llava-v1.5-13b

**1.43. InternVL for Multimodal Dialogue using LLaVA Codebase 465**


**Launch a model worker (4-bit, 8-bit inference, quantized)**

You can launch the model worker with quantized bits (4-bit, 8-bit), which allows you to run the inference with reduced
GPU memory footprint, potentially allowing you to run on a GPU with as few as 12GB VRAM. Note that inference
with quantized bits may not be as accurate as the full-precision model. Simply append --load-4bit or --load-8bit
to the **model worker** command that you are executing. Below is an example of running with 4-bit quantization.

python -m llava.serve.model_worker --host 0 .0.0.0 --controller [http://localhost:10000](http://localhost:10000) --
˓→port 40000 --worker [http://localhost:40000](http://localhost:40000) --model-path liuhaotian/llava-v1.5-13b --
˓→load-4bit

**Launch a model worker (LoRA weights, unmerged)**

You can launch the model worker with LoRA weights, without merging them with the base checkpoint, to save disk
space. There will be additional loading time, while the inference speed is the same as the merged checkpoints. Un-
merged LoRA checkpoints do not have lora-merge in the model name, and are usually much smaller (less than 1GB)
than the merged checkpoints (13G for 7B, and 25G for 13B).

To load unmerged LoRA weights, you simply need to pass an additional argument --model-base, which is the base
LLM that is used to train the LoRA weights. You can check the base LLM of each LoRA weights in themodel zoo.

python -m llava.serve.model_worker --host 0 .0.0.0 --controller [http://localhost:10000](http://localhost:10000) --
˓→port 40000 --worker [http://localhost:40000](http://localhost:40000) --model-path liuhaotian/llava-v1-0719-336px-
˓→lora-vicuna-13b-v1.3 --model-base lmsys/vicuna-13b-v1.3

**CLI Inference**

Chat about images using LLaVA without the need of Gradio interface. It also supports multiple GPUs, 4-bit and 8-bit
quantized inference. With 4-bit quantization, for our LLaVA-1.5-7B, it uses less than 8GB VRAM on a single GPU.

python -m llava.serve.cli\
--model-path liuhaotian/llava-v1.5-7b \
--image-file"https://llava-vl.github.io/static/images/view.jpg" \
--load-4bit

**Train**

_Below is the latest training configuration for LLaVA v1.5. For legacy models, please refer to README ofthisversion
for now. We’ll add them in a separate doc later._

LLaVA training consists of two stages: (1) feature alignment stage: use our 558K subset of the LAION-CC-SBU
dataset to connect a _frozen pretrained_ vision encoder to a _frozen LLM_ ; (2) visual instruction tuning stage: use 150K
GPT-generated multimodal instruction-following data, plus around 515K VQA data from academic-oriented tasks, to
teach the model to follow multimodal instructions.

LLaVA is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you canreduce the
per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the
global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

**Hyperparameters**

We use a similar set of hyperparameters as Vicuna innetuning. Both hyperparameters used in pretraining andne-
tuning are provided below.

1. Pretraining

**466 Chapter 1. Documentation**


```
Hyperparameter Global Batch Size Learning rate Epochs Max length Weight decay
LLaVA-v1.5-13B 256 1e-3 1 2048 0
```
2. Finetuning

```
Hyperparameter Global Batch Size Learning rate Epochs Max length Weight decay
LLaVA-v1.5-13B 128 2e-5 1 2048 0
```
**Download Vicuna checkpoints (automatically)**

Our base model Vicuna v1.5, which is an instruction-tuned chatbot, will be downloaded automatically when you run
our provided training scripts. No action is needed.

**Pretrain (feature alignment)**

Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paperhere.

Pretrain takes around 5.5 hours for LLaVA-v1.5-13B on 8x A100 (80G), due to the increased resolution to 336px. It
takes around 3.5 hours for LLaVA-v1.5-7B.

Training script with DeepSpeed ZeRO-2:pretrain.sh.

- --mm_projector_type mlp2x_gelu: the two-layer MLP vision-language connector.
- --vision_tower openai/clip-vit-large-patch14-336: CLIP ViT-L/14 336px.

**Visual Instruction Tuning**

1. Prepare data

Please download the annotation of thenal mixture our instruction tuning datallava_v1_5_mix665k.json, and down-
load the images from constituting datasets:

- COCO:train2017
- GQA:images
- OCR-VQA:download script
- TextVQA:train_val_images
- VisualGenome:part1,part2

After downloading all of them, organize the data as follows in ./playground/data,

```
coco
train2017
gqa
images
ocr_vqa
images
textvqa
train_images
vg
VG_100K
VG_100K_2
```
**1.43. InternVL for Multimodal Dialogue using LLaVA Codebase 467**


2. Start training!

You may download our pretrained projectors inModel Zoo. It is not recommended to use legacy projectors, as they
may be trained with a dierent version of the codebase, and if any option is o, the model will not function/train as we
expected.

Visual instruction tuning takes around 20 hours for LLaVA-v1.5-13B on 8x A100 (80G), due to the increased resolution
to 336px. It takes around 10 hours for LLaVA-v1.5-7B on 8x A100 (40G).

Training script with DeepSpeed ZeRO-3:finetune.sh.

New options to note:

- --mm_projector_type mlp2x_gelu: the two-layer MLP vision-language connector.
- --vision_tower openai/clip-vit-large-patch14-336: CLIP ViT-L/14 336px.
- --image_aspect_ratio pad: this pads the non-square images to square, instead of cropping them; it slightly
    reduces hallucination.
- --group_by_modality_length True: this should only be used when your instruction tuning dataset contains
    both language (e.g. ShareGPT) and multimodal (e.g. LLaVA-Instruct). It makes the training sampler only sample
    a single modality (either image or language) during training, which we observe to speed up training by ~25%,
    and does not aect thenal outcome.

**Evaluation**

In LLaVA-1.5, we evaluate models on a diverse set of 12 benchmarks. To ensure the reproducibility, we evaluate the
models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the
chat demo of real-time outputs.

SeeEvaluation.md.

**GPT-assisted Evaluation**

Our GPT-assisted evaluation pipeline for multimodal modeling is provided for a comprehensive understanding of the
capabilities of vision-language models. Please see our paper for more details.

1. Generate LLaVA responses

python model_vqa.py \
--model-path ./checkpoints/LLaVA-13B-v0\
--question-file \
playground/data/coco2014_val_qa_eval/qa90_questions.jsonl\
--image-folder\
/path/to/coco2014_val\
--answers-file\
/path/to/answer-file-our.jsonl

2. Evaluate the generated responses. In our case, answer-file-ref.jsonl is the response generated by text-only
    GPT-4 (0314), with the context captions/boxes provided.

OPENAI_API_KEY="sk-***********************************" python llava/eval/eval_gpt_
˓→review_visual.py\
--question playground/data/coco2014_val_qa_eval/qa90_questions.jsonl\
--context llava/eval/table/caps_boxes_coco2014_val_80.jsonl\
--answer-list\
/path/to/answer-file-ref.jsonl\
/path/to/answer-file-our.jsonl\
(continues on next page)

**468 Chapter 1. Documentation**


```
(continued from previous page)
--rule llava/eval/table/rule.json\
--output /path/to/review.json
```
3. Summarize the evaluation results

python summarize_gpt_review.py

**Citation**

If yound LLaVA useful for your research and applications, please cite using this BibTeX:

@misc{liu2023improvedllava,
title={Improved Baselines with Visual Instruction Tuning},
author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae},
publisher={arXiv:2310.03744},
year={2023},
}

@misc{liu2023llava,
title={Visual Instruction Tuning},
author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
publisher={arXiv:2304.08485},
year={2023},
}

**Acknowledgement**

- Vicuna: the codebase we built upon, and our base model Vicuna-13B that has the amazing language capabilities!

**Related Projects**

- Instruction Tuning with GPT-4
- LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
- Otter: In-Context Multi-Modal Instruction Tuning

For future project ideas, please check out:

- SEEM: Segment Everything Everywhere All at Once
- Grounded-Segment-Anythingto detect, segment, and generate anything by marryingGrounding DINOand
    Segment-Anything.

## 1.44 InternVL Stage-2 Pre-training & Retrieval Fine-tuning

This folder contains the implementation of the InternVL 1.0 for stage2 pre-training and retrieval ne-tuning, which
corresponds to Section 4.3 of ourInternVL 1.0 paper.

**1.44. InternVL Stage-2 Pre-training & Retrieval Fine-tuning 469**


**1.44.1 Data Preparation**

Three datasets need to be prepared: COCO Caption, Flickr30K, and NoCaps.

mkdir -p data/coco&& cd data/coco

# download coco images
wget [http://images.cocodataset.org/zips/train2014.zip&&](http://images.cocodataset.org/zips/train2014.zip&&) unzip train2014.zip
wget [http://images.cocodataset.org/zips/val2014.zip&&](http://images.cocodataset.org/zips/val2014.zip&&) unzip val2014.zip
wget [http://images.cocodataset.org/zips/test2015.zip&&](http://images.cocodataset.org/zips/test2015.zip&&) unzip test2015.zip

mkdir -p annotations&& cdannotations/
# download converted annotation files
wget https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_
˓→train.json
wget https://github.com/OpenGVLab/InternVL/releases/download/data/coco_karpathy_test.json
wget https://github.com/OpenGVLab/InternVL/releases/download/data/coco_karpathy_test_gt.
˓→json
cd ../../../

mkdir -p data/flickr30k &&cd data/flickr30k

# download images from https://bryanplummer.com/Flickr30kEntities/
# karpathy split annotations can be downloaded from the following link:
# https://github.com/mehdidc/retrieval_annotations/releases/download/1.0.0/flickr30k_
˓→test_karpathy.txt
# this file is provided by the clip-benchmark repository.
# We convert this txt file to json format, download the converted file:
wget https://github.com/OpenGVLab/InternVL/releases/download/data/flickr30k_cn_test.txt
wget https://github.com/OpenGVLab/InternVL/releases/download/data/flickr30k_cn_train.txt
wget https://github.com/OpenGVLab/InternVL/releases/download/data/flickr30k_test_
˓→karpathy.json
wget https://github.com/mehdidc/retrieval_annotations/releases/download/1.0.0/flickr30k_
˓→test_karpathy.txt
(continues on next page)

**470 Chapter 1. Documentation**


```
(continued from previous page)
```
wget https://github.com/mehdidc/retrieval_annotations/releases/download/1.0.0/flickr30k_
˓→train_karpathy.txt
wget https://github.com/mehdidc/retrieval_annotations/releases/download/1.0.0/flickr30k_
˓→val_karpathy.txt

cd ../..

mkdir -p data/nocaps&& cddata/nocaps

# download images from https://nocaps.org/download
# original annotations can be downloaded from https://nocaps.s3.amazonaws.com/nocaps_val_
˓→4500_captions.json
wget https://nocaps.s3.amazonaws.com/nocaps_val_4500_captions.json

cd ../..

After the download is complete, the directory structure is:

data
coco
annotations
coco_karpathy_train.json
test2017
train2014
train2017
val2014
val2017
flickr30k
flickr30k_cn_test.txt
flickr30k_cn_train.txt
flickr30k_test_karpathy.json
flickr30k_test_karpathy.txt
flickr30k_train_karpathy.txt
flickr30k_val_karpathy.txt
Images
nocaps
images
nocaps_val_4500_captions.json

**1.44.2 Model Preparation**

```
model name type param download size
InternVL-14B-224px huggingface 13.8B HF link 27.7 GB
```
Download the above model weights and place them in the pretrained/ folder.

cd pretrained/
# pip install -U huggingface_hub
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/
˓→InternVL-14B-224px --local-dir InternVL-14B-224px

**1.44. InternVL Stage-2 Pre-training & Retrieval Fine-tuning 471**


The directory structure is:

pretrained
InternVL-14B-224px/

**1.44.3 Generative Pre-training**

There are currently no plans to release this part of the code.

**1.44.4 Evaluation**

**Zero-Shot Image Captioning**

```
model dataset BLEU4 METEOR CIDEr
InternVL-G COCO Karpathy test 37.1 30.1 128.2
InternVL-G Flickr30K Karpathy test 27.0 25.3 79.2
InternVL-G NoCaps val 44.3 30.1 113.7
```
sh evaluate.sh pretrained/InternVL-14B-224px caption-coco

Expected results:

['coco','English caption:', 10.5974, dict_items([('Bleu_1',0.7876323287981284), ('Bleu_
˓→ 2 ',0.6353512494727918), ('Bleu_3', 0.49108984183589743), ('Bleu_4',0.
˓→ 37062736733849205 ), ('METEOR', 0.30106315496945923), ('ROUGE_L',0.5898249189475652), (
˓→'CIDEr', 1.281844384075423)])]

sh evaluate.sh pretrained/InternVL- 14 B- 224 px caption-flickr30k

Expected results:

['flickr30k','English caption:', 10 .666, dict_items([('Bleu_1', 0 .7182900534357628), (
˓→'Bleu_2', 0 .5353390037921949), ('Bleu_3', 0 .3834462132295285),('Bleu_4', 0.
˓→ 2702131471765472 ), ('METEOR', 0 .25263515267930103),('ROUGE_L', 0 .5305876871149064),(
˓→'CIDEr', 0 .7919734768328237)])]

sh evaluate.sh pretrained/InternVL-14B-224px caption-nocaps

Expected results:

['nocaps', 'English caption:',10.463111111111111, dict_items([('Bleu_1',0.
˓→ 8518290482155187 ), ('Bleu_2',0.7165227921485106), ('Bleu_3', 0.5733723839888316), (
˓→'Bleu_4', 0.44268902150723105), ('METEOR',0.30078174807736896), ('ROUGE_L',0.
˓→ 6070208063052156 ), ('CIDEr',1.1371742045267772)])]

**Fine-tuned Image-Text Retrieval**

**Flickr30Kne-tuned model: InternVL-14B-Flickr30K-FT-364px**

cd ../clip_benchmark/
CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
(continues on next page)

**472 Chapter 1. Documentation**


```
(continued from previous page)
˓→language "en"--task"zeroshot_retrieval" \
--dataset"flickr30k"--dataset_root ./data/flickr30k --model internvl_c_retrieval_
˓→hf \
--pretrained ./work_dirs/internvl_stage2_finetune_flickr_364_bs1024_ep10/ --output␣
˓→result_ft.json
```
Expected results:

{"dataset": "flickr30k", "model": "internvl_c_retrieval_hf","pretrained":"./work_dirs/
˓→internvl_stage2_finetune_flickr_364_bs1024_ep10","task": "zeroshot_retrieval",
"metrics": {"image_retrieval_recall@1":0.8853999972343445, "text_retrieval_recall@1":0.
˓→ 972000002861023 ,
"image_retrieval_recall@5":0.9836000204086304,"text_retrieval_recall@5":1.0,
"image_retrieval_recall@10": 0.9923999905586243,"text_retrieval_recall@10": 1.0},
˓→"language":"en"}

cd ../clip_benchmark/
CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "en"--task"zeroshot_retrieval" \
--dataset"flickr30k"--dataset_root ./data/flickr30k --model internvl_g_retrieval_
˓→hf \
--pretrained ./work_dirs/internvl_stage2_finetune_flickr_364_bs1024_ep10/ --output␣
˓→result_ft.json

Expected results:

{"dataset": "flickr30k", "model": "internvl_g_retrieval_hf","pretrained":"./work_dirs/
˓→internvl_stage2_finetune_flickr_364_bs1024_ep10","task": "zeroshot_retrieval",
"metrics": {"image_retrieval_recall@1":0.895799994468689, "text_retrieval_recall@1": 0.
˓→ 9789999723434448 ,
"image_retrieval_recall@5":0.9861999750137329,"text_retrieval_recall@5":1.0,
"image_retrieval_recall@10": 0.9922000169754028,"text_retrieval_recall@10": 1.0},
˓→"language":"en"}

**Flickr30K-CNne-tuned model: InternVL-14B-FlickrCN-FT-364px**

cd ../clip_benchmark/
CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "cn"--task"zeroshot_retrieval" \
--dataset"flickr30k"--dataset_root ./data/flickr30k --model internvl_c_retrieval_
˓→hf \
--pretrained ./work_dirs/internvl_stage2_finetune_flickrcn_364_bs1024_ep10/ --
˓→output result_ft.json

Expected results:

{"dataset": "flickr30k", "model": "internvl_c_retrieval_hf","pretrained":"./work_dirs/
˓→internvl_stage2_finetune_flickrcn_364_bs1024_ep10","task":"zeroshot_retrieval",
"metrics": {"image_retrieval_recall@1":0.8521999716758728, "text_retrieval_recall@1":0.
˓→ 9649999737739563 ,
"image_retrieval_recall@5":0.9697999954223633,"text_retrieval_recall@5":0.
˓→ 9990000128746033 ,
(continues on next page)

**1.44. InternVL Stage-2 Pre-training & Retrieval Fine-tuning 473**


```
(continued from previous page)
```
"image_retrieval_recall@10": 0.9854000210762024,"text_retrieval_recall@10": 1.0},
˓→"language":"cn"}

cd ../clip_benchmark/
CUDA_VISIBLE_DEVICES= 0 python3 clip_benchmark/cli.pyeval--model_type internvl --
˓→language "cn"--task"zeroshot_retrieval" \
--dataset"flickr30k"--dataset_root ./data/flickr30k --model internvl_g_retrieval_
˓→hf \
--pretrained ./work_dirs/internvl_stage2_finetune_flickrcn_364_bs1024_ep10/ --
˓→output result_ft.json

Expected results:

{"dataset": "flickr30k", "model": "internvl_g_retrieval_hf","pretrained":"./work_dirs/
˓→internvl_stage2_finetune_flickrcn_364_bs1024_ep10","task":"zeroshot_retrieval",
"metrics": {"image_retrieval_recall@1":0.8587999939918518, "text_retrieval_recall@1":0.
˓→ 968999981880188 ,
"image_retrieval_recall@5":0.9714000225067139,"text_retrieval_recall@5":0.
˓→ 9990000128746033 ,
"image_retrieval_recall@10": 0.9865999817848206,"text_retrieval_recall@10": 1.0},
˓→"language":"cn"}

**1.44.5 Retrieval Fine-tuning (Fully)**

```
Note: In our experiments, full parameterne-tuning achieves the best results on image-text retrieval tasks
in Flickr30K and COCO. By following the experimental hyperparameters in this section, you can reproduce
the model performance reported in the Evaluation section.
```
Tone-tune InternVL on Flickr30K with 32 GPUs and slurm system, run:

PARTITION='your partition'GPUS= 32 sh shell/finetune/internvl_stage2_finetune_flickr_364_
˓→bs1024_ep10.sh

Tone-tune InternVL on Flickr30K-CN with 32 GPUs and slurm system, run:

PARTITION='your partition'GPUS= 32 sh shell/finetune/internvl_stage2_finetune_flickrcn_
˓→364_bs1024_ep10.sh

Tone-tune InternVL on COCO with 32 GPUs and slurm system, run:

PARTITION='your partition'GPUS= 32 sh shell/finetune/internvl_stage2_finetune_coco_364_
˓→bs1024_ep5.sh

The hyperparameters used here are:

**474 Chapter 1. Documentation**


```
config Flickr30K Flickr30K-CN COCO
learning rate 1e-6 1e-6 1e-6
layer-wise lrdecay
rate
```
```
InternViT-6B (0.9),QL-
LaMA (0.9)
```
```
InternViT-6B (0.9),QL-
LaMA (0.9)
```
```
InternViT-6B (0.9),QL-
LaMA (0.9)
optimizer AdamW AdamW AdamW
weight decay 0.05 0.05 0.05
input resolution 364x364 364x364 364x364
total batch size 1024 1024 1024
warm-up iterations 100 100 100
training epochs 10 10 5
drop path rate 0.3 0.3 0.3
numerical preci-
sion
```
```
zero1 + bf16 zero1 + bf16 zero1 + bf16
```
```
trainable / total
params
```
### 14B / 14B 14B / 14B 14B / 14B

```
GPUs for training 32 ×A100 (80G) 32 ×A100 (80G) 32 ×A100 (80G)
Required GPU
memory
```
### 80G 80G 80G

**1.44.6 Retrieval Fine-tuning (Head)**

```
Note: This section demonstrates how to perform a cost-eectivene-tuning of our model. The hyperpa-
rameters shown here are not optimized for any specic task. For practical applications, further adjustments
to the hyperparameters may be necessary to achieve optimal performance.
```
Tone-tune the head of InternVL on Flickr30K with 4 GPUs, run:

GPUS= 4 BATCH_SIZE= 32 sh shell/head_finetune/internvl_stage2_finetune_flickr_224_bs1024_
˓→ep10_head_4gpu.sh

Tone-tune the head of InternVL on Flickr30K-CN with 4 GPUs, run:

GPUS= 4 BATCH_SIZE= 32 sh shell/head_finetune/internvl_stage2_finetune_flickrcn_224_bs1024_
˓→ep10_head_4gpu.sh

Tone-tune the head of InternVL on COCO with 4 GPUs, run:

GPUS= 4 BATCH_SIZE= 32 shell/head_finetune/internvl_stage2_finetune_coco_224_bs1024_ep5_
˓→head_4gpu.sh

The hyperparameters used here are:

**1.44. InternVL Stage-2 Pre-training & Retrieval Fine-tuning 475**


```
config Flickr30K Flickr30K-CN COCO
learning rate 1e-6 1e-6 1e-6
optimizer AdamW AdamW AdamW
weight decay 0.05 0.05 0.05
input resolution 224x224 224x224 224x224
total batch size 4x32 4x32 4x32
warm-up iterations 100 100 100
training epochs 10 10 5
drop path rate 0.0 0.0 0.3
numerical precision zero3 + bf16 zero3 + bf16 zero1 + bf16
trainable / total params 0.2B / 14B 0.2B / 14B 0.2B / 14B
GPUs for training 4 ×GPU (>=32G) 4 ×GPU (>=32G) 4 ×GPU (>=32G)
Required GPU memory 24G 24G 24G
```
**1.44.7 Retrieval Fine-tuning (LoRA)**

```
Note: This section demonstrates how to perform a cost-eectivene-tuning of our model. The hyperpa-
rameters shown here are not optimized for any specic task. For practical applications, further adjustments
to the hyperparameters may be necessary to achieve optimal performance.
```
Tone-tune InternVL using LoRA on Flickr30K with 4 GPUs, run:

GPUS= 4 BATCH_SIZE= 32 sh shell/lora_finetune/internvl_stage2_finetune_flickr_224_bs1024_
˓→ep10_lora16_4gpu.sh

Tone-tune InternVL using LoRA on Flickr30K-CN with 4 GPUs, run:

GPUS= 4 BATCH_SIZE= 32 sh shell/lora_finetune/internvl_stage2_finetune_flickrcn_224_bs1024_
˓→ep10_lora16_4gpu.sh

Tone-tune InternVL using LoRA on COCO with 4 GPUs, run:

GPUS= 4 BATCH_SIZE= 32 shell/lora_finetune/internvl_stage2_finetune_coco_224_bs1024_ep5_
˓→lora16_4gpu.sh

The hyperparameters used here are:

```
config Flickr30K Flickr30K-CN COCO
learning rate 1e-6 1e-6 1e-6
optimizer AdamW AdamW AdamW
lora rank 16 16 16
weight decay 0.05 0.05 0.05
input resolution 224x224 224x224 224x224
total batch size 4x32 4x32 4x32
warm-up iterations 100 100 100
training epochs 10 10 5
drop path rate 0.0 0.0 0.3
numerical precision zero3 + bf16 zero3 + bf16 zero1 + bf16
trainable / total params 0.3B / 14B 0.3B / 14B 0.3B / 14B
GPUs for training 4 ×GPU (>=40G) 4 ×GPU (>=40G) 4 ×GPU (>=40G)
Required GPU memory 37G 37G 37G
```
**476 Chapter 1. Documentation**


**1.44.8 Fine-Tuning a Custom Dataset**

1. **Organize Your Data** : Format your dataset similar to COCO or Flickr30K.
2. **Update Meta Information** : Add your dataset’s meta information to the ds_collections dictionary in
    internvl_g/internvl/train/internvl_stage2_finetune.py. For example:

```
ds_collections= {
'my_dataset_flickr_format': {
'root': './data/my_dataset/images/',
'annotation':'./data/my_dataset/annotations.txt',
},
'my_dataset_coco_format': {
'root': './data/my_dataset/',
'annotation':'./data/my_dataset/annotations.json',
},
}
```
3. **Name Your Dataset** :
    - Include flickr_format or coco_format in your dataset’s dataset_name. This will allow the script to
       reuse the Flickr30K or COCO dataloader accordingly.

By following these steps, you can easilyne-tune the InternVL model on your custom dataset using the existing COCO
or Flickr30K data loading mechanisms.

**1.44.9 Citation**

If yound this project useful in your research, please consider citing:

@inproceedings{chen2024internvl,
title={Internvl: Scaling up vision foundation models and aligning for generic visual-
˓→linguistic tasks},
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and␣
˓→Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and␣
˓→others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern␣
˓→Recognition},
pages={24185--24198},
year={2024}
}

**1.44. InternVL Stage-2 Pre-training & Retrieval Fine-tuning 477**


**478 Chapter 1. Documentation**


### CHAPTER

## TWO

## INDICES AND TABLES

- genindex
- search

### 479