Qwen2.5-VL-7B GUI Agent for OSWorld-Verified & GDPval

A fine-tuned Qwen2.5-VL-7B-Instruct model optimized for GUI grounding and desktop computer use, targeting high scores on OSWorld-Verified and GDPval benchmarks.

Training Recipe

Based on published SOTA approaches:

Paper	Key Finding	Our Implementation
Jedi (XLANG, 2025)	SFT on 4M GUI grounding → 7B beats UI-TARS-72B	Same base model (Qwen2.5-VL-7B), same output format
Gelato (MLFoundations, 2025)	Click-100k curated dataset → SOTA grounding	Using Click-100k as primary dataset
GDPval (OpenAI, 2024)	Instruction-following + formatting critical	Tool-calling conversations preserve these skills

Training Details

Base model: Qwen/Qwen2.5-VL-7B-Instruct (8.3B params)
Method: SFT with LoRA (r=16, alpha=32)
Datasets:
- mlfoundations/Click-100k — 101k curated GUI grounding samples from 10+ sources
- showlab/ShowUI-desktop — 7.5k desktop-specific grounding
Hyperparameters: 2 epochs, lr=2e-5, cosine schedule, 3% warmup, batch_size=8 (effective)
Resolution: 1080p max (max_pixels=2116800, matching Jedi-7B-1080p)
Quantization: 4-bit NF4 for training efficiency

Output Format

The model predicts click coordinates in normalized (x, y) format:

<point>(0.3456, 0.7890)</point>

Usage

from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from peft import PeftModel
from PIL import Image

# Load base model + LoRA adapter
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, "legolasyiu/Qwen2.5-VL-7B-GUI-OSWorld-GDPval")

processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    min_pixels=200704,
    max_pixels=2116800,
)

# Predict click location
image = Image.open("screenshot.png")
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Click on the search bar"},
    ]}
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=50)
print(processor.decode(output[0], skip_special_tokens=True))
# Output: <point>(0.5234, 0.0891)</point>

For OSWorld Evaluation

Use this model as the grounding component in a planner-grounder architecture:

Planner: GPT-4o/o3/Claude generates natural language action descriptions
Grounder: This model converts descriptions to pixel coordinates on screenshots
Architecture follows Jedi paper — achieves 51% on OSWorld with o3 planner

For GDPval

The model retains Qwen2.5-VL's strong instruction-following and knowledge work capabilities. The GUI grounding training actually improves structured output quality through tool-calling patterns in the training data.

Training Script

See train_gui_grounding.py for the full training script.

To run training:

pip install transformers trl torch datasets trackio accelerate peft bitsandbytes qwen-vl-utils
python train_gui_grounding.py

Or via HuggingFace Jobs:

# Requires pre-paid credits
huggingface-cli jobs run train_gui_grounding.py --hardware a100-large --timeout 6h

Benchmark Targets

Benchmark	Metric	SOTA	Target
OSWorld-Verified	Task success rate	48.9% (ComputerRL)	Competitive with SOTA 7B models
OSWorld-G	Grounding accuracy	54.1% (Jedi-7B)	Match or exceed
ScreenSpot-v2	Click accuracy	91.7% (Jedi-7B)	Match or exceed
GDPval	Win/tie vs expert	47.6% (Claude Opus)	Leverage preserved Qwen2.5 capabilities

Citation

If you use this model, please cite the underlying works:

@article{jedi2025,
  title={Scaling Computer-Use Grounding via UI Decomposition and Synthesis},
  author={XLANG Lab},
  journal={arXiv:2505.13227},
  year={2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for legolasyiu/Qwen2.5-VL-7B-GUI-OSWorld-GDPval

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Finetuned

(1048)

this model

Datasets used to train legolasyiu/Qwen2.5-VL-7B-GUI-OSWorld-GDPval

Papers for legolasyiu/Qwen2.5-VL-7B-GUI-OSWorld-GDPval

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Paper • 2510.04374 • Published Oct 5, 2025

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Paper • 2505.13227 • Published May 19, 2025 • 45