--- license: mit language: - en base_model: - Qwen/Qwen2.5-VL-3B-Instruct pipeline_tag: visual-question-answering --- This repository contains the model presented in [UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning](https://huggingface.co/papers/2503.21620). Project page: https://github.com/lll6gg/UI-R1 New version: [UI-R1-E-3B](https://huggingface.co/LZXzju/Qwen2.5-VL-3B-UI-R1-E) ## Benchmark 1: ScreenSpotV2 | ScreenSpotV2 | inference mode | Mobile-T | Mobile-I | Desktop-T | Desktop-I | Web-T | Web-I | Avg↑ / Len↓ | | ------------- | -------------- | -------- | -------- | --------- | --------- | -------- | -------- | ----------------- | | OS-ATLAS-7B | w/o thinking | 95.2 | 75.8 | 90.7 | 63.6 | 90.6 | 77.3 | 84.1 / | | UI-TARS-7B | w/o thinking | 95.2 | 79.1 | 90.7 | 68.6 | 90.6 | 78.3 | 84.7 / | | UI-R1-3B (v1) | w/ thinking | 96.2 | **84.3** | 92.3 | 63.6 | 89.2 | 75.4 | 85.4 / 67 | | GUI-R1-3B | w/ thinking | 97.6 | 78.2 | 94.3 | 64.3 | 91.0 | 72.4 | 85.0 / 80 | | UI-R1-3B (v2) | w/ thinking | 97.6 | 79.6 | 92.3 | 67.9 | 88.9 | 77.8 | 85.8 / 60 | | **UI-R1-E-3B** | w/o thinking | **98.2** | 83.9 | **94.8** | **75.0** | **93.2** | **83.7** | **89.5** / **28** | ## Benchmark 2: ScreenSpot-Pro | ScreenSpot-Pro | inference mode | Average Length↓ | Average Accuracy↑ | | -------------- | -------------- | --------------- | ---------------- | | UGround-7B | w/o thinking | - | 16.5 | | OS-ATLAS-7B | w/o thinking | - | 18.9 | | UI-R1-3B (v1) | w/ thinking | 102 | 17.8 | | GUI-R1-3B | w/ thinking | 114 | 26.6 | | UI-R1-3B (v2) | w/ thinking | 129 | 29.8 | | **UI-R1-E-3B** | w/o thinking | **28** | **33.5** | ## Leaderboard: UI-I2E-Bench | Model | ScreenSpot | UI-I2E-Bench Avg | ScreenSpot-Pro | Avg | | :------------: | :--------: | :--------------: | :------------: | :--: | | UI-TARS-1.5-7B | 88.1 | 73.2 | 42.2 | 67.8 | | Uground-V1-72B | 89.7 | 76.3 | 34.3 | 66.8 | | UI-TARS-72B | 88.4 | 73.7 | 38.1 | 66.7 | | **UI-R1-E-3B** | 89.2 | 69.1 | 33.5 | 63.9 | | Uground-V1-7B | 87.1 | 70.3 | 31.1 | 62.8 | | InfiGUI-R1 | 87.5 | 69.7 | 29.6 | 62.3 | | UI-TARS-7B | 89.5 | 61.4 | 35.7 | 62.2 | | Qwen2.5-VL-72B | 87.1 | 51.4 | 43.6 | 60.7 | | UI-I2E-VLM-7B | 82.5 | 69.5 | 23.6 | 58.5 | | UI-TARS-2B | 82.3 | 62 | 27.7 | 57.3 | | Qwen2.5-VL-7B | 84.7 | 53.8 | 29 | 55.8 | | OmniParser-V2 | 72 | 54.8 | 39.6 | 55.5 | | Uground-V1-2B | 78.8 | 57.4 | 26.6 | 54.3 | | OS-Atlas-7B | 82.5 | 58.6 | 18.9 | 53.3 | | **UI-R1-3B** | 83.3 | 58.5 | 17.8 | 53.2 | | UGround-7B | 74.1 | 54.2 | 16.5 | 48.3 | | UI-I2E-VLM-4B | 70.4 | 53.4 | 12.2 | 45.3 | | OmniParser | 73.9 | 53.1 | 8.3 | 45.1 | | ShowUI-2B | 76.8 | 41.5 | 7.7 | 42 | | Qwen2.5-VL-3B | 55.5 | 41.7 | 23.9 | 41.3 | | Aguvis-7B | 84.4 | 53.2 | 22.9 | 40.4 | | OS-Atlas-4B | 70.1 | 44.3 | 3.7 | 39.4 | | Qwen2-VL-7B | 42.6 | 48.7 | 1.6 | 31 | | Seeclick | 55.8 | 26.4 | 1.1 | 27.8 | | InternVL2-4B | 4.2 | 0.9 | 0.3 | 1.8 | ## Evaluation Code for GUI Grounding 1. Generation for UI-R1-E-3B: ```python model = Qwen2_5_VLForConditionalGeneration.from_pretrained( args.model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="cpu", ) model = model.to(torch.device(rank)) model = model.eval() processor = AutoProcessor.from_pretrained(ori_processor_path) question_template = ( f"In this UI screenshot, I want to perform the command '{task_prompt}'.\n" "Please provide the action to perform (enumerate in ['click', 'scroll']) and the coordinate where the cursor is moved to(integer) if click is performed.\n" "Output the thinking process in and final answer in tags." "The output answer format should be as follows:\n" " ... [{'action': enum['click', 'scroll'], 'coordinate': [x, y]}]\n" "Please strictly follow the format." ) query = '\n' + question_template messages = [ { "role": "user", "content": [ {"type": "image", "image": image_path} ] + [{"type": "text", "text": query}], } ] text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) generated_ids = model.generate(**inputs, max_new_tokens=1024) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] response = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) response = response[0] pred_coord, _ = extract_coord(response) ``` 2. Rescale the predicted coordinate according to the image resize (especially image_size > 12845056) ```python image = Image.open(image_path) origin_width, origin_height = image.size resized_height,resized_width = smart_resize(origin_height,origin_width,max_pixels=12845056) scale_x = origin_width / resized_width scale_y = origin_height / resized_height pred_coord[0] = int(pred_coord[0] * scale_x) pred_coord[1] = int(pred_coord[1] * scale_y) ``` Function smart_resize is from Qwen2VL: ```python import math def smart_resize( height: int, width: int, factor: int = 28, min_pixels: int = 56 * 56, max_pixels: int = 14 * 14 * 4 * 1280 ): """Rescales the image so that the following conditions are met: 1. Both dimensions (height and width) are divisible by 'factor'. 2. The total number of pixels is within the range ['min_pixels', 'max_pixels']. 3. The aspect ratio of the image is maintained as closely as possible. """ if height < factor or width < factor: raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}") elif max(height, width) / min(height, width) > 200: raise ValueError( f"absolute aspect ratio must be smaller than 200, got {max(height, width) / min(height, width)}" ) h_bar = round(height / factor) * factor w_bar = round(width / factor) * factor if h_bar * w_bar > max_pixels: beta = math.sqrt((height * width) / max_pixels) h_bar = math.floor(height / beta / factor) * factor w_bar = math.floor(width / beta / factor) * factor elif h_bar * w_bar < min_pixels: beta = math.sqrt(min_pixels / (height * width)) h_bar = math.ceil(height * beta / factor) * factor w_bar = math.ceil(width * beta / factor) * factor return h_bar, w_bar ```