Abstract
Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction.Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot's end-effector.However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks and inefficient execution in long-horizon reasoning.In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning.Specifically, we adopt Segment Anything (SAM) pre-trained on a huge number of images and promptable masks as the foundation model for extracting task-relevant features, and employ parameter-efficient fine-tuning on robot data for a better understanding of embodied scenarios.To address long-horizon reasoning, we develop a novel multichannel heatmap that enables the prediction of the action sequence in a single pass, notably enhancing execution efficiency.Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to the baselines, and also significantly improves generalization in few-shot adaptation to new tasks.
Original language | English |
---|---|
Pages (from-to) | 58579-58598 |
Number of pages | 20 |
Journal | Proceedings of Machine Learning Research |
Volume | 235 |
State | Published - 2024 |
Externally published | Yes |
Event | 41st International Conference on Machine Learning, ICML 2024 - Vienna, Austria Duration: 21 Jul 2024 → 27 Jul 2024 |