SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

Junjie Zhang, Chenjia Bai, Haoran He, Zhigang Wang, Bin Zhao, Xiu Li, Xuelong Li

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations

Abstract

Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction.Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot's end-effector.However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks and inefficient execution in long-horizon reasoning.In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning.Specifically, we adopt Segment Anything (SAM) pre-trained on a huge number of images and promptable masks as the foundation model for extracting task-relevant features, and employ parameter-efficient fine-tuning on robot data for a better understanding of embodied scenarios.To address long-horizon reasoning, we develop a novel multichannel heatmap that enables the prediction of the action sequence in a single pass, notably enhancing execution efficiency.Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to the baselines, and also significantly improves generalization in few-shot adaptation to new tasks.

Original languageEnglish
Pages (from-to)58579-58598
Number of pages20
JournalProceedings of Machine Learning Research
Volume235
StatePublished - 2024
Externally publishedYes
Event41st International Conference on Machine Learning, ICML 2024 - Vienna, Austria
Duration: 21 Jul 202427 Jul 2024

Fingerprint

Dive into the research topics of 'SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation'. Together they form a unique fingerprint.

Cite this