RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation
Abstract
Visual identity prompting enhances manipulation data augmentation for robot policies by providing explicit visual guidance to diffusion models, improving policy performance in both simulation and real-world settings.
The diversity, quantity, and quality of manipulation data are critical for training effective robot policies. However, due to hardware and physical setup constraints, collecting large-scale real-world manipulation data remains difficult to scale across diverse environments. Recent work uses text-prompt conditioned image diffusion models to augment manipulation data by altering the backgrounds and tabletop objects in the visual observations. However, these approaches often overlook the practical need for multi-view and temporally coherent observations required by state-of-the-art policy models. Further, text prompts alone cannot reliably specify the scene setup. To provide the diffusion model with explicit visual guidance, we introduce visual identity prompting, which supplies exemplar images as conditioning inputs to guide the generation of the desired scene setup. To this end, we also build a scalable pipeline to curate a visual identity pool from large robotics datasets. Using our augmented manipulation data to train downstream vision-language-action and visuomotor policy models yields consistent performance gains in both simulation and real-robot settings.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Clutter-Resistant Vision-Language-Action Models through Object-Centric and Geometry Grounding (2025)
- IGen: Scalable Data Generation for Robot Learning from Open-World Images (2025)
- AnchorDream: Repurposing Video Diffusion for Embodiment-Aware Robot Data Synthesis (2025)
- Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface (2025)
- Towards High-Consistency Embodied World Model with Multi-View Trajectory Videos (2025)
- Image Generation as a Visual Planner for Robotic Manipulation (2025)
- H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper