# InternVL Documentation Release 2.0 — LMDeploy Contributors Last updated: May 29, 2025 ## Get Started ### Installation - Clone the repository: ``` git clone https://github.com/OpenGVLab/InternVL.git cd InternVL ``` - Create and activate a conda environment: ``` conda create -n internvl python=3.10 conda activate internvl ``` - Install core dependencies: ``` pip install -r requirements.txt ``` By default, `requirements.txt` includes: - `-r requirements/internvl_chat.txt` - `-r requirements/streamlit_demo.txt` - `-r requirements/classification.txt` - `-r requirements/segmentation.txt` The `clip_benchmark.txt` is not included by default. If you need zero-shot classification/retrieval evaluation: ``` pip install -r requirements/clip_benchmark.txt ``` ### Optional Dependencies - flash-attn 2.3.6 (for training chat models): ``` pip install flash-attn==2.3.6 --no-build-isolation ``` Or build from source: ``` git clone https://github.com/Dao-AILab/flash-attention.git cd flash-attention git checkout v2.3.6 python setup.py install ``` - mmcv-full 1.6.2 (for segmentation): ``` pip install -U openmim mim install mmcv-full==1.6.2 ``` - NVIDIA Apex (for segmentation): ``` git clone https://github.com/NVIDIA/apex.git cd apex git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82 pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation \ --config-settings "--build-option=--cpp_ext" \ --config-settings "--build-option=--cuda_ext" . ``` Note: If you encounter `ModuleNotFoundError: No module named 'fused_layer_norm_cuda'`, Apex CUDA extensions were not installed. You can uninstall Apex to fall back to PyTorch RMSNorm, or patch `setup.py` and rebuild. ## Chat Data Format ### Dataset Configuration In InternVL 2.0 and 2.5, the organization of the training data is controlled by several key parameters to optimize the balance and distribution of datasets during training. - Data augmentation: JPEG compression is applied conditionally — enabled for image datasets to enhance robustness, and disabled for video datasets to maintain consistent frame quality. - Maximum tile number: The parameter `n_max` controls the maximum tiles per dataset. Higher values (24–36) are used for multi-image or high-resolution data; lower values (6–12) for standard images; and 1 for videos. - Repeat factor: The repeat factor `r` adjusts dataset sampling frequency. Values below 1 reduce a dataset’s weight, while values above 1 increase it. This helps balance training across tasks and prevents overfitting or underfitting. ### Meta File In this document, we will detail the organization format of our conversation data. Currently, we use a JSON file to manage the meta information of all datasets. The format is as follows: ```json { "your-custom-dataset-1": { "root": "path/to/the/image/", "annotation":"path/to/the/jsonl/annotation", "data_augment": false, "max_dynamic_patch": 12 , "repeat_time": 1 , "length":"number of samples in the dataset" }, ... } ``` Here, root is the root directory of the dataset, annotation is the path to the annotationle, data_augment indicates whether data augmentation is needed, repeat_time is the number of times the dataset is repeated, and length is the number of samples in the dataset. For example, a file for the ShareGPT4V dataset looks like this: ```json { "sharegpt4v_instruct_gpt4-vision_cap100k": { "root": "playground/data/", "annotation":"playground/opensource/sharegpt4v_instruct_gpt4-vision_cap100k.jsonl", "data_augment": false, "max_dynamic_patch": 12 , "repeat_time": 1 , "length": 102025 }, ... } ``` You can add multiple datasets in this JSON file, similar to this file. We currently support the following four types of datasets: pure text data, single-image data, multi-image (interleaved) data, and video data. We do not require all entries in a JSONL le to be of the same type, meaning your JSONLle can contain dierent types of data. ### Pure Text Data For pure text data, we use a JSONL file to store the data. Each entry is a dictionary organized in the following format. Note that entries for pure text data should not contain an image field. ```json { "id": 0 , "conversations": [ {"from":"human","value":"user input"}, {"from":"gpt", "value": "assistant output"}, {"from":"human","value":"user input"}, {"from":"gpt", "value": "assistant output"} ] } ``` Here, id is the unique identifier for the data, and conversations is a list containing multiple conversations. Each conversation is a dictionary with two key-value pairs: from and value. from indicates the speaker’s identity, and value indicates the speaker’s input. This data format is consistent with LLaVA 1.5, with the difference being that LLaVA 1.5 uses JSON, while we use JSONL. Example: ```json { "id": 0 , "conversations": [ { "from":"human", "value":"Find the length of the hypotenuse given two sides\n\"side_length1 = 5,␣ ˓→side_length2 = 6\"" }, { "from":"gpt", "value":"To find the length of the hypotenuse given the two sides of a right␣ ˓→triangle, you can use the Pythagorean theorem. The theorem states that the square of␣ ˓→the length of the hypotenuse is equal to the sum of the squares of the lengths of the␣ ˓→two other sides.\n\nIn this case, the length of one side is 5, and the length of the␣ ˓→other side is 6. Plugging these values into the formula, we get:\n\nhypotenuse^2 = 5^2␣ ˓→+ 6^2\n = 25 + 36\n = 61\n\nTo get the length of the␣ ˓→hypotenuse, take the square root of 61:\n\nhypotenuse = 61\n = 7.81␣ ˓→(rounded to two decimal places)\n\nTherefore, the length of the hypotenuse is␣ ˓→approximately 7.81 units." } ``` ] } ``` ### Single-Image Data For single-image data, we use a JSONL file to store the data. Each entry is a dictionary organized in the following format. Note that each entry for single-image data must contain an image field, which is a string. The path in the image field is relative to the root field. Concatenating the root field and the image field gives the complete path to the image. It is recommended to include width and height information for each data sample for future use. ```json { "id": 0, "image": "path/to/image.jpg", "width": 111, "height": 222, "conversations": [ {"from": "human", "value": "\nuser input"}, {"from": "gpt", "value": "assistant output"}, {"from": "human", "value": "user input"}, {"from": "gpt", "value": "assistant output"} ] } ``` Here, indicates the position where the image is inserted, and the number of placeholders should match the number of images. In single-image data, the placeholder should appear only once across all conversations. An example of single-image data: ```json { "id": 0, "image": "images/00000000.jpg", "width": 897, "height": 1152, "conversations": [ { "from": "human", "value": "\nCan you extract any readable text from the image?" }, { "from": "gpt", "value": "Dares Wins Vol. 5 Tommy's Heroes Vol. 6: For Tomorrow Vol. 7: Closing Time miniseries. Clark Kent is being interviewed about Superman's connection to ˓→notorious killer Tommy Monaghan. Taking the conversation..." } ] } ### Grounding / Detection Data For grounding or detection data, we normalize each coordinate box to the range [0, 1000] using the width and height of the image, and round the normalized coordinates to integers: ```python def normalize_coordinates(box, image_width, image_height): x1, y1, x2, y2 = box normalized_box = [ round((x1 / image_width) * 1000), round((y1 / image_height) * 1000), round((x2 / image_width) * 1000), round((y2 / image_height) * 1000), ] return normalized_box ``` We organize the model’s responses in the following format: class name[[x1, y1, x2, y2], ...] An example of the RefCOCO dataset: ```json Here is the more readable version: ```json { "id": 2324 , "image": "COCO_train2014_000000581857.jpg", "width": 427 , "height": 640 , "conversations": [ { "from":"human", "value":"\nPlease provide the bounding box coordinate of the region this␣ ˓→sentence describes: the lady with the blue shirt" }, { "from":"gpt", "value":"the lady with the blue shirt[[243, 469, 558, 746]]" }, { "from":"human", "value":"Please provide the bounding box coordinate of the region this sentence␣ ˓→describes: woman in gray shirt facing camera on right" }, { "from":"gpt", "value":"woman in gray shirt facing camera on right[[507, 409,␣ ˓→698, 740]]" }, { "from":"human", "value":"Please provide the bounding box coordinate of the region this sentence␣ ˓→describes: a woman in glasses shops in an open air fruit market" }, { "from":"gpt", "value":"a woman in glasses shops in an open air fruit market ˓→[[196, 422, 395, 708]]" } ] } ``` An example of the V3Det dataset: {"id": 78281 ,"image":"images/x00001541/000106464.jpg","width": 800 ,"height": 800 , ˓→"conversations": [{"from": "human", "value": "\nPlease detect and label all␣ ˓→objects in the following image and mark their positions."}, {"from":"gpt","value": ˓→"Sure, I will detect and label all objects in the image and mark their positions.\n\ ˓→n\nBulletproof Helmet[[1, 2, 269, 235]]\nBulletproof␣ ˓→Clothing[[650, 619, 990, 993]]\nGun Model[[32, 231,␣ ˓→977, 662]]\nscrew[[754, 376, 851, 429]]\nhandcuff[[698, 228, 931, 386]]\n95 Type Assault Rifle[[39, 229,␣ ˓→983, 667]]\nshield[[30, 597, 273, 993]]\ntelescope[[666, 38, 890, 217]]\nWireless Walkie-Talkie[[295, 2,␣ ˓→370, 226], [374, 0, 447, 226]]\nbomb[[473, 61, 552, 181], [569,␣ ˓→61, 648, 183]]\nweapon[[302, 617, 342, 993]]\nvessel[[355, 653, 644, 991]]\nartifact[[915, 0, 981, 294]]\n\n"}]} Here is the more readable version: ```json { "id": 78281 , "image": "images/x00001541/000106464.jpg", "width": 800 , "height": 800 , "conversations": [ { "from":"human", "value":"\nPlease detect and label all objects in the following image and␣ ˓→mark their positions." }, { "from":"gpt", "value":"Sure, I will detect and label all objects in the image and mark their␣ ˓→positions.\n\nBulletproof Helmet[[1, 2, 269, 235]]\n ˓→Bulletproof Clothing[[650, 619, 990, 993]]\nGun Model ˓→[[32, 231, 977, 662]]\nscrew[[754, 376, 851, 429]]\n ˓→handcuff[[698, 228, 931, 386]]\n95 Type Assault Rifle ˓→[[39, 229, 983, 667]]\nshield[[30, 597, 273, 993]]\n ˓→telescope[[666, 38, 890, 217]]\nWireless Walkie-Talkie ˓→[[295, 2, 370, 226], [374, 0, 447, 226]]\nbomb[[473, 61, 552,␣ ˓→181], [569, 61, 648, 183]]\nweapon[[302, 617, 342, 993]]\n ˓→vessel[[355, 653, 644, 991]]\nartifact[[915, 0,␣ ˓→981, 294]]\n" } ] } ``` **1.2.5 Multi-Image Data** For multi-image data, we use a JSONL le to store the data. Each entry is a dictionary organized in the following format. Note that each entry for multi-image data must contain an imageeld, which is a list of strings. Each element in the list is a path relative to the rooteld. Concatenating the rooteld and each element gives the complete path to the images. It is recommended to include width_list and height_list information for each data sample for future use. { "id": 0 , "image": ["path/to/image1.jpg", "path/to/image2.jpg", "path/to/image3.jpg"], "width_list": [ 111 , 222 , 333 ], "height_list": [ 111 , 222 , 333 ], "conversations": [ {"from":"human","value":"\nuser input \nuser input"}, {"from":"gpt", "value": "assistant output"}, {"from":"human","value":"\nuser input"}, {"from":"gpt", "value": "assistant output"} ] } Here, indicates the position where the images are inserted, and the number of placeholders should match the number of images. In this example, the imageeld list contains three elements, so the placeholder also needs to appear three times. An example of multi-image data: {"id": 0 ,"image": ["cimages/multimages/16/5pc.png","cimages/multimages/16/5pd.png", ˓→"cimages/multimages/16/1602207874_p5b.png", "cimages/multimages/16/5pe.png","cimages/ ˓→multimages/16/1473016381_p5a.png"], "height_list": [ 23 , 22 , 23 , 41 , 52 ],"width_list":␣ ˓→[ 240 , 240 , 240 , 240 , 240 ], "conversations": [{"from": "human","value":"Let F = {2, 5, ˓→7, 9}\n\nLet G = {1, 4, 6, 8}\n\nWhich of the following is true?\nA. \n\n\nB. / ˓→\n\n\nC. /\n\n\nD. /\n\n\nE. /\n\n\nAnswer with the option ˓→'s letter from the given choices directly."}, {"from":"gpt", "value": "A"}]} Here is the more readable version: { "id": 0 , "image": [ "cimages/multimages/16/5pc.png", "cimages/multimages/16/5pd.png", "cimages/multimages/16/1602207874_p5b.png", "cimages/multimages/16/5pe.png", "cimages/multimages/16/1473016381_p5a.png" ], "height_list": [ 23 , 22 , 23 , 41 , 52 ], "width_list": [ 240 , 240 , 240 , 240 , 240 ], "conversations": [ { "from":"human", "value":"Let F = {2, 5, 7, 9}\n\nLet G = {1, 4, 6, 8}\n\nWhich of the following␣ ˓→is true?\nA. \n\n\nB. /\n\n\nC. /\n\n\nD. /\n\n\nE. /\n ˓→\n\nAnswer with the option's letter from the given choices directly." (continues on next page) **10 Chapter 1. Documentation** (continued from previous page) }, { "from":"gpt", "value":"A" } ] } **1.2.6 Video Data** For video data, we use a JSONL le to store the data. Each entry is a dictionary organized in the following format. Note that each entry for video data must contain a videoeld, which is a string. The path in the videoeld is relative to the rooteld. Concatenating the rooteld and the videoeld gives the complete path to the video. { "id": 0 , "video": "path/to/video.mp4", "conversations": [ {"from":"human","value":"