# VLM
We follow [InternVL2](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html) to evaluate the performance on MME, MMBench, MMMU, MMVet, MathVista and MMVP.
## Data prepration
Please follow the [InternVL2](https://internvl.readthedocs.io/en/latest/get_started/eval_data_preparation.html) to prepare the corresponding data. And the link the data under `vlm`.
The final directory structure is:
```shell
data
├── MathVista
├── mmbench
├── mme
├── MMMU
├── mm-vet
└── MMVP
```
## Evaluation
Directly run `scripts/eval/run_eval_vlm.sh` to evaluate different benchmarks. The output will be saved in `$output_path`.
- Set `$model_path` and `$output_path` for the path for checkpoint and log.
- Increase `GPUS` if you want to run faster.
- For MMBench, please use the official [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission).
- For MMVet, please use the official [evaluation server](https://huggingface.co/spaces/whyu/MM-Vet_Evaluator).
- For MathVista, please set `$openai_api_key` in `scripts/eval/run_eval_vlm.sh` and `your_api_url` in `eval/vlm/eval/mathvista/utilities.py`. The default GPT version is `gpt-4o-2024-11-20`.
- For MMMU, we use CoT in the report, which improve the accuracy by about 2%. For evaluation of the oprn-ended answer, we use GPT-4o for judgement.
# GenEval
We modify the code in [GenEval](https://github.com/djghosh13/geneval/tree/main) for faster evaluation.
## Setup
Install the following dependencies:
```shell
pip install open-clip-torch
pip install clip-benchmark
pip install --upgrade setuptools
sudo pip install -U openmim
sudo mim install mmengine mmcv-full==1.7.2
git clone https://github.com/open-mmlab/mmdetection.git
cd mmdetection; git checkout 2.x
pip install -v -e .
```
Download Detector:
```shell
cd ./eval/gen/geneval
mkdir model
bash ./evaluation/download_models.sh ./model
```
## Evaluation
Directly run `scripts/eval/run_geneval.sh` to evaluate GenEVAL. The output will be saved in `$output_path`.
- Set `$model_path` and `$output_path` for the path for checkpoint and log.
- Set `metadata_file` to `./eval/gen/geneval/prompts/evaluation_metadata.jsonl` for original GenEval prompts.
# WISE
We modify the code in [WISE](https://github.com/PKU-YuanGroup/WISE/tree/main) for faster evaluation.
## Evaluation
Directly run `scripts/eval/run_wise.sh` to evaluate WISE. The output will be saved in `$output_path`.
- Set `$model_path` and `$output_path` for the path for checkpoint and log.
- Set `$openai_api_key` in `scripts/eval/run_wise.sh` and `your_api_url` in `eval/gen/wise/gpt_eval_mp.py`. The default GPT version is `gpt-4o-2024-11-20`.
- Use `think` for thinking mode.
# GEdit-Bench
We adopt the code in [GEdit-Bench](https://github.com/stepfun-ai/Step1X-Edit/blob/main/GEdit-Bench/EVAL.md) for evaluation.
## Evaluation
Modify the model path, the output path, the api key, and the api url in `scripts/eval/run_gedit.sh`. Then, run the following command:
```shell
bash script/eval/run_gedit.sh
```
The GPT version for evaluation is `gpt-4.1-2025-04-14`.
# IntelligentBench
TBD
# KRIS
We modify the code in [KRIS-Bench](https://github.com/mercurystraw/Kris_Bench) for faster evaluation.
## Data prepration
Please download the benchmark data from [KRIS-Bench](https://huggingface.co/datasets/Liang0223/KRIS_Bench) and and place it in the `KRIS_Bench` directory.
The final directory structure is:
```shell
KRIS_Bench
├── abstract_reasoning
├── anomaly_correction
├── biology
├── chemistry
├── color_change
├── count_change
├── geography
├── humanities
├── mathematics
├── medicine
├── multi-element_composition
├── multi-instruction_execution
├── part_completion
├── physics
├── position_movement
├── practical_knowledge
├── rule-based_reasoning
├── size_adjustment
├── temporal_prediction
└── viewpoint_change
```
## Evaluation
Directly run `scripts/eval/run_kris.sh` to evaluate KRIS-Bench. The output will be saved in `$output_path`.
- Set `$model_path` and `$output_path` for the path for checkpoint and log.
- Set `$openai_api_key` in `scripts/eval/run_kris.sh` and `your_api_url` in `eval/gen/kris/metrics_xx.py`. The default GPT version is `gpt-4o-2024-11-20`.
- Use `think` for thinking mode.
- We set `cfg_text_scale=4` and `cfg_img_scale=1.5` by default. Additionally, `cfg_renorm_min=0` is specified for CFG Renorm.
Results
Category, meta-category, and overall average scores (100-point scale):
Attribute Perception:
VC: 76.64
VQ: 74.45
IF: 41.73
AVG: 64.27
Spatial Perception:
VC: 70.25
VQ: 80.00
IF: 37.00
AVG: 62.42
Temporal Prediction:
VC: 36.49
VQ: 61.82
IF: 29.05
AVG: 42.45
Social Science:
VC: 76.20
VQ: 78.80
IF: 37.00
KP: 29.60
AVG: 55.40
Natural Science:
VC: 69.59
VQ: 84.03
IF: 40.27
KP: 30.15
AVG: 56.01
Logical Reasoning:
VC: 80.17
VQ: 85.67
IF: 26.33
KP: 18.00
AVG: 52.54
Instruction Decomposition:
VC: 40.17
VQ: 69.50
IF: 42.00
AVG: 50.56
Factual Knowledge:
AVG: 60.26
Conceptual Knowledge:
AVG: 55.86
Procedural Knowledge:
AVG: 51.69
Overall:
AVG: 56.21
Results w/ CoT
Category, meta-category, and overall average scores (100-point scale):
Attribute Perception:
VC: 75.09
VQ: 74.00
IF: 53.18
AVG: 67.42
Spatial Perception:
VC: 78.75
VQ: 87.25
IF: 39.00
AVG: 68.33
Temporal Prediction:
VC: 48.31
VQ: 81.08
IF: 46.62
AVG: 58.67
Social Science:
VC: 80.40
VQ: 79.40
IF: 51.60
KP: 42.80
AVG: 63.55
Natural Science:
VC: 67.68
VQ: 82.95
IF: 52.10
KP: 42.88
AVG: 61.40
Logical Reasoning:
VC: 62.83
VQ: 79.67
IF: 28.33
KP: 21.67
AVG: 48.12
Instruction Decomposition:
VC: 47.83
VQ: 66.83
IF: 36.00
AVG: 50.22
Factual Knowledge:
AVG: 66.18
Conceptual Knowledge:
AVG: 61.92
Procedural Knowledge:
AVG: 49.02
Overall:
AVG: 60.18
# RISE
We modify the code in [RISEBench](https://github.com/PhoenixZ810/RISEBench) for faster evaluation.
## Data prepration
Please download the benchmark data from [RISEBench](https://huggingface.co/datasets/PhoenixZ/RISEBench) and and place it in the `data` directory.
The final directory structure is:
```shell
data
├── datav2_total_w_subtask.json
├── causal_reasoning_images
├── logical_reasoning_images
├── spatial_reasoning_images
└── temporal_reasoning_images
```
## Evaluation
Directly run `scripts/eval/run_rise.sh` to evaluate RISEBench. The output will be saved in `$output_path`.
- Set `$model_path` and `$output_path` for the path for checkpoint and log.
- Set `$openai_api_key` in `scripts/eval/run_rise.sh` and `your_api_url` in `eval/gen/rise/gpt_eval.py`. The default GPT version is `gpt-4.1-2025-04-14`.
- Use `think` for thinking mode.
- We set `cfg_text_scale=4` and `cfg_img_scale=2.0` by default. Additionally, `cfg_renorm_min=0` is specified for CFG Renorm.
Results (cfg_img_scale=1.5)
- Score-Origin Score-Percentage Accuracy
0 Overall 2.537778 38.444444 0.061111
1 Temporal 2.654118 41.352941 0.023529
2 Causal 2.788889 44.722222 0.055556
3 Spatial 3.452000 61.300000 0.140000
4 Logical 1.080000 2.000000 0.011765
5 Overall_Reasoning 2.458333 36.458333 NaN
6 Overall_ApprConsistency 3.141643 53.541076 NaN
7 Overall_VisualPlausibility_total 3.920000 73.000000 NaN
8 Temporal_Reasoning 2.588235 39.705882 NaN
9 Temporal_Consistency 3.250000 56.250000 NaN
10 Temporal_Quality 3.505882 62.647059 NaN
11 Causal_Reasoning 2.733333 43.333333 NaN
12 Causal_Consistency 3.579545 64.488636 NaN
13 Causal_Quality 3.688889 67.222222 NaN
14 Spatial_Reasoning 3.300000 57.500000 NaN
15 Spatial_Consistency 3.330000 58.250000 NaN
16 Spatial_Quality 4.480000 87.000000 NaN
17 Logical_Reasoning 1.047059 1.176471 NaN
18 Logical_Consistency 2.364706 34.117647 NaN
19 Temp-Life Progression 2.757895 43.947368 0.000000
20 Temp-Material Progression 2.500000 37.500000 0.021739
21 Temp-Environmental Cycles 3.061538 51.538462 0.076923
22 Temp-Societal Transformation 2.628571 40.714286 0.000000
23 Causal-Structural Deformation 2.766667 44.166667 0.055556
24 Causal-State Transition 3.112000 52.800000 0.080000
25 Causal-Chemical and Biological Transformation 2.325000 33.125000 0.062500
26 Causal-Physics Manifestation 2.800000 45.000000 0.000000
27 Spa-Component Assembly 3.434783 60.869565 0.043478
28 Spa-Object Arrangement 2.733333 43.333333 0.000000
29 Spa-Viewpoint Generation 3.629630 65.740741 0.222222
30 Spa-Structural Inference 4.066667 76.666667 0.133333
31 Spa-Layout Reasoning 3.234783 55.869565 0.217391
32 Logic-Pattern Prediction 1.035484 0.887097 0.000000
33 Logic-Mathematical Derivation 1.350000 8.750000 0.071429
34 Logic-Puzzle Solving 1.020000 0.500000 0.000000
Results w/ CoT
- Score-Origin Score-Percentage Accuracy
0 Overall 2.933333 48.333333 0.119444
1 Temporal 3.336471 58.411765 0.058824
2 Causal 3.608889 65.222222 0.177778
3 Spatial 3.492000 62.300000 0.210000
4 Logical 1.157647 3.941176 0.011765
5 Overall_Reasoning 2.836111 45.902778 NaN
6 Overall_ApprConsistency 3.951841 73.796034 NaN
7 Overall_VisualPlausibility_total 4.203636 80.090909 NaN
8 Temporal_Reasoning 3.188235 54.705882 NaN
9 Temporal_Consistency 4.225000 80.625000 NaN
10 Temporal_Quality 4.200000 80.000000 NaN
11 Causal_Reasoning 3.533333 63.333333 NaN
12 Causal_Consistency 4.386364 84.659091 NaN
13 Causal_Quality 4.100000 77.500000 NaN
14 Spatial_Reasoning 3.350000 58.750000 NaN
15 Spatial_Consistency 4.300000 82.500000 NaN
16 Spatial_Quality 4.300000 82.500000 NaN
17 Logical_Reasoning 1.141176 3.529412 NaN
18 Logical_Consistency 2.835294 45.882353 NaN
19 Temp-Life Progression 3.526316 63.157895 0.052632
20 Temp-Material Progression 3.208696 55.217391 0.086957
21 Temp-Environmental Cycles 3.584615 64.615385 0.000000
22 Temp-Societal Transformation 3.200000 55.000000 0.000000
23 Causal-Structural Deformation 3.750000 68.750000 0.138889
24 Causal-State Transition 3.792000 69.800000 0.320000
25 Causal-Chemical and Biological Transformation 3.512500 62.812500 0.062500
26 Causal-Physics Manifestation 2.984615 49.615385 0.153846
27 Spa-Component Assembly 3.652174 66.304348 0.304348
28 Spa-Object Arrangement 2.700000 42.500000 0.000000
29 Spa-Viewpoint Generation 3.800000 70.000000 0.259259
30 Spa-Structural Inference 3.680000 67.000000 0.266667
31 Spa-Layout Reasoning 3.260870 56.521739 0.130435
32 Logic-Pattern Prediction 1.064516 1.612903 0.000000
33 Logic-Mathematical Derivation 1.707143 17.678571 0.071429
34 Logic-Puzzle Solving 1.037500 0.937500 0.000000
# ImgEdit
We modify the code in [ImgEdit](https://github.com/PKU-YuanGroup/ImgEdit) for faster evaluation.
## Data prepration
Please download the benchmark data from [ImgEdit-Bench](https://huggingface.co/datasets/sysuyy/ImgEdit/blob/main/Benchmark.tar) and and place it in the `Benchmark` directory.
The final directory structure is:
```shell
Benchmark
├── hard
├── multiturn
└── singleturn
├── judge_prompt.json
├── singleturn.json
├── animal
├── architecture
├── clothes
├── compose
├── daily object
├── for_add
├── human
├── style
└── transport
```
## Evaluation
Directly run `scripts/eval/run_imgedit.sh` to evaluate ImgEdit-Bench. The output will be saved in `$output_path`.
- Set `$model_path` and `$output_path` for the path for checkpoint and log.
- Set `$openai_api_key` in `scripts/eval/run_imgedit.sh` and `your_api_url` in `eval/gen/imgedit/basic_bench.py`. The default GPT version is `gpt-4o-2024-11-20`.
- We set `cfg_text_scale=4` and `cfg_img_scale=1.5` by default. Additionally, `cfg_renorm_min=0` is specified for CFG Renorm.
Results
background: 3.28
adjust: 3.23
style: 4.26
extract: 1.48
remove: 2.99
add: 3.45
replace: 3.76
compose: 3.18
action: 4.38
overall: 3.28