GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks
Paper • 2510.04374 • Published
A fine-tuned Qwen2.5-VL-7B-Instruct model optimized for GUI grounding and desktop computer use, targeting high scores on OSWorld-Verified and GDPval benchmarks.
Based on published SOTA approaches:
| Paper | Key Finding | Our Implementation |
|---|---|---|
| Jedi (XLANG, 2025) | SFT on 4M GUI grounding → 7B beats UI-TARS-72B | Same base model (Qwen2.5-VL-7B), same output format |
| Gelato (MLFoundations, 2025) | Click-100k curated dataset → SOTA grounding | Using Click-100k as primary dataset |
| GDPval (OpenAI, 2024) | Instruction-following + formatting critical | Tool-calling conversations preserve these skills |
The model predicts click coordinates in normalized (x, y) format:
<point>(0.3456, 0.7890)</point>
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from peft import PeftModel
from PIL import Image
# Load base model + LoRA adapter
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(model, "legolasyiu/Qwen2.5-VL-7B-GUI-OSWorld-GDPval")
processor = AutoProcessor.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
min_pixels=200704,
max_pixels=2116800,
)
# Predict click location
image = Image.open("screenshot.png")
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "Click on the search bar"},
]}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=50)
print(processor.decode(output[0], skip_special_tokens=True))
# Output: <point>(0.5234, 0.0891)</point>
Use this model as the grounding component in a planner-grounder architecture:
The model retains Qwen2.5-VL's strong instruction-following and knowledge work capabilities. The GUI grounding training actually improves structured output quality through tool-calling patterns in the training data.
See train_gui_grounding.py for the full training script.
To run training:
pip install transformers trl torch datasets trackio accelerate peft bitsandbytes qwen-vl-utils
python train_gui_grounding.py
Or via HuggingFace Jobs:
# Requires pre-paid credits
huggingface-cli jobs run train_gui_grounding.py --hardware a100-large --timeout 6h
| Benchmark | Metric | SOTA | Target |
|---|---|---|---|
| OSWorld-Verified | Task success rate | 48.9% (ComputerRL) | Competitive with SOTA 7B models |
| OSWorld-G | Grounding accuracy | 54.1% (Jedi-7B) | Match or exceed |
| ScreenSpot-v2 | Click accuracy | 91.7% (Jedi-7B) | Match or exceed |
| GDPval | Win/tie vs expert | 47.6% (Claude Opus) | Leverage preserved Qwen2.5 capabilities |
If you use this model, please cite the underlying works:
@article{jedi2025,
title={Scaling Computer-Use Grounding via UI Decomposition and Synthesis},
author={XLANG Lab},
journal={arXiv:2505.13227},
year={2025}
}
Base model
Qwen/Qwen2.5-VL-7B-Instruct