How Physics and Background Attributes Impact Video Transformers in Robotic Manipulation: A Case Study on Planar Pushing

As model and dataset sizes continue to scale in robot learning, the need to understand how the composition and properties of a dataset affect model performance becomes increasingly urgent to ensure cost-effective data collection and model performance. In this work, we empirically investigate how phy...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings of the ... IEEE/RSJ International Conference on Intelligent Robots and Systems pp. 7391 - 7398
Main Authors	Jin, Shutong, Wang, Ruiyu, Zahid, Muhammad, Pokorny, Florian T.
Format	Conference Proceeding
Language	English
Published	IEEE 14.10.2024
Subjects	Adaptation models Data models Image color analysis Intelligent robots Physics Robot learning Shape Source coding Trajectory Transformers
Online Access	Get full text
ISSN	2153-0866
DOI	10.1109/IROS58592.2024.10802583

Cover

Loading…

More Information
Summary:	As model and dataset sizes continue to scale in robot learning, the need to understand how the composition and properties of a dataset affect model performance becomes increasingly urgent to ensure cost-effective data collection and model performance. In this work, we empirically investigate how physics attributes (color, friction coefficient, shape) and scene background characteristics, such as the complexity and dynamics of interactions with background objects, influence the performance of Video Transformers in predicting planar pushing trajectories. We investigate three primary questions: How do physics attributes and background scene characteristics influence model performance? What kind of changes in attributes are most detrimental to model generalization? What proportion of fine-tuning data is required to adapt models to novel scenarios? To facilitate this research, we present CloudGripper-Push-1K, a large real-world vision-based robot pushing dataset comprising 1278 hours and 460,000 videos of planar pushing interactions with objects with different physics and background attributes. We also propose Video Occlusion Transformer (VOT), a generic modular video-transformer-based trajectory prediction framework which features 3 choices of 2D-spatial encoders as the subject of our case study. The dataset and source code are available at https://cloudgripper.org.
ISSN:	2153-0866
DOI:	10.1109/IROS58592.2024.10802583