Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation
Paper
•
2509.23866
•
Published
•
13
DART-GUI-7B is a vision-language model fine-tuned from UITARS-7B, specifically designed for GUI (Graphical User Interface) understanding and interaction tasks. Built on the Qwen2.5-VL architecture, this model demonstrates strong multimodal understanding capabilities for GUI-related applications.
pip install transformers torch accelerate pillow
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch
# Load model and processor
processor = AutoProcessor.from_pretrained("your-org/dart-gui-7b")
model = AutoModelForVision2Seq.from_pretrained(
"your-org/dart-gui-7b",
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Prepare input
image = Image.open("your_image.jpg")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this interface"}
]
}
]
# Process input and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs, inputs = processor(
text=[text],
images=[image],
videos=None,
padding=True,
return_tensors="pt"
).to(model.device)
# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
print(output_text[0])
If you use this model in your research, please cite:
@article{li2025efficient,
title={Efficient multi-turn rl for gui agents via decoupled training and adaptive data curation},
author={Li, Pengxiang and Hu, Zechen and Shang, Zirui and Wu, Jingrong and Liu, Yang and Liu, Hui and Gao, Zhi and Shi, Chenrui and Zhang, Bofei and Zhang, Zihao and others},
journal={arXiv preprint arXiv:2509.23866},
year={2025}
}
This model was jointly developed by BIGAI (Beijing Institute for General Artificial Intelligence) and DataCanvas (九章云极).
This model is licensed under Apache 2.0.
For questions or suggestions, please submit an Issue through the Hugging Face repository.
Base model
ByteDance-Seed/UI-TARS-1.5-7B