Qwen2.5-VL-7B GUI Agent for OSWorld-Verified & GDPval

A fine-tuned Qwen2.5-VL-7B-Instruct model optimized for GUI grounding and desktop computer use, targeting high scores on OSWorld-Verified and GDPval benchmarks.

Training Recipe

Based on published SOTA approaches:

Paper Key Finding Our Implementation
Jedi (XLANG, 2025) SFT on 4M GUI grounding → 7B beats UI-TARS-72B Same base model (Qwen2.5-VL-7B), same output format
Gelato (MLFoundations, 2025) Click-100k curated dataset → SOTA grounding Using Click-100k as primary dataset
GDPval (OpenAI, 2024) Instruction-following + formatting critical Tool-calling conversations preserve these skills

Training Details

  • Base model: Qwen/Qwen2.5-VL-7B-Instruct (8.3B params)
  • Method: SFT with LoRA (r=16, alpha=32)
  • Datasets:
  • Hyperparameters: 2 epochs, lr=2e-5, cosine schedule, 3% warmup, batch_size=8 (effective)
  • Resolution: 1080p max (max_pixels=2116800, matching Jedi-7B-1080p)
  • Quantization: 4-bit NF4 for training efficiency

Output Format

The model predicts click coordinates in normalized (x, y) format:

<point>(0.3456, 0.7890)</point>

Usage

from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from peft import PeftModel
from PIL import Image

# Load base model + LoRA adapter
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, "legolasyiu/Qwen2.5-VL-7B-GUI-OSWorld-GDPval")

processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    min_pixels=200704,
    max_pixels=2116800,
)

# Predict click location
image = Image.open("screenshot.png")
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Click on the search bar"},
    ]}
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=50)
print(processor.decode(output[0], skip_special_tokens=True))
# Output: <point>(0.5234, 0.0891)</point>

For OSWorld Evaluation

Use this model as the grounding component in a planner-grounder architecture:

  1. Planner: GPT-4o/o3/Claude generates natural language action descriptions
  2. Grounder: This model converts descriptions to pixel coordinates on screenshots
  3. Architecture follows Jedi paper — achieves 51% on OSWorld with o3 planner

For GDPval

The model retains Qwen2.5-VL's strong instruction-following and knowledge work capabilities. The GUI grounding training actually improves structured output quality through tool-calling patterns in the training data.

Training Script

See train_gui_grounding.py for the full training script.

To run training:

pip install transformers trl torch datasets trackio accelerate peft bitsandbytes qwen-vl-utils
python train_gui_grounding.py

Or via HuggingFace Jobs:

# Requires pre-paid credits
huggingface-cli jobs run train_gui_grounding.py --hardware a100-large --timeout 6h

Benchmark Targets

Benchmark Metric SOTA Target
OSWorld-Verified Task success rate 48.9% (ComputerRL) Competitive with SOTA 7B models
OSWorld-G Grounding accuracy 54.1% (Jedi-7B) Match or exceed
ScreenSpot-v2 Click accuracy 91.7% (Jedi-7B) Match or exceed
GDPval Win/tie vs expert 47.6% (Claude Opus) Leverage preserved Qwen2.5 capabilities

Citation

If you use this model, please cite the underlying works:

@article{jedi2025,
  title={Scaling Computer-Use Grounding via UI Decomposition and Synthesis},
  author={XLANG Lab},
  journal={arXiv:2505.13227},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for legolasyiu/Qwen2.5-VL-7B-GUI-OSWorld-GDPval

Finetuned
(1048)
this model

Datasets used to train legolasyiu/Qwen2.5-VL-7B-GUI-OSWorld-GDPval

Papers for legolasyiu/Qwen2.5-VL-7B-GUI-OSWorld-GDPval