SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

Junjie Zhang; Chenjia Bai; Haoran He; Zhigang Wang; Bin Zhao; Xiu Li; Xuelong Li

SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

Junjie Zhang, Chenjia Bai, Haoran He, Zhigang Wang, Bin Zhao, Xiu Li, Xuelong Li

Research output: Contribution to journal › Conference article › peer-review

1 Scopus citations

Abstract

Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction.Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot's end-effector.However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks and inefficient execution in long-horizon reasoning.In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning.Specifically, we adopt Segment Anything (SAM) pre-trained on a huge number of images and promptable masks as the foundation model for extracting task-relevant features, and employ parameter-efficient fine-tuning on robot data for a better understanding of embodied scenarios.To address long-horizon reasoning, we develop a novel multichannel heatmap that enables the prediction of the action sequence in a single pass, notably enhancing execution efficiency.Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to the baselines, and also significantly improves generalization in few-shot adaptation to new tasks.

Original language	English
Pages (from-to)	58579-58598
Number of pages	20
Journal	Proceedings of Machine Learning Research
Volume	235
State	Published - 2024
Externally published	Yes
Event	41st International Conference on Machine Learning, ICML 2024 - Vienna, Austria Duration: 21 Jul 2024 → 27 Jul 2024

Cite this

@article{f9b3024b64e941edbf4ca0c2021c1a66,

title = "SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation",

abstract = "Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction.Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot's end-effector.However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks and inefficient execution in long-horizon reasoning.In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning.Specifically, we adopt Segment Anything (SAM) pre-trained on a huge number of images and promptable masks as the foundation model for extracting task-relevant features, and employ parameter-efficient fine-tuning on robot data for a better understanding of embodied scenarios.To address long-horizon reasoning, we develop a novel multichannel heatmap that enables the prediction of the action sequence in a single pass, notably enhancing execution efficiency.Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to the baselines, and also significantly improves generalization in few-shot adaptation to new tasks.",

author = "Junjie Zhang and Chenjia Bai and Haoran He and Zhigang Wang and Bin Zhao and Xiu Li and Xuelong Li",

year = "2024",

language = "英语",

volume = "235",

pages = "58579--58598",

journal = "Proceedings of Machine Learning Research",

issn = "2640-3498",

publisher = "ML Research Press",

}

TY - JOUR

T1 - SAM-E

T2 - 41st International Conference on Machine Learning, ICML 2024

AU - Zhang, Junjie

AU - Bai, Chenjia

AU - He, Haoran

AU - Wang, Zhigang

AU - Zhao, Bin

AU - Li, Xiu

AU - Li, Xuelong

PY - 2024

Y1 - 2024

N2 - Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction.Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot's end-effector.However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks and inefficient execution in long-horizon reasoning.In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning.Specifically, we adopt Segment Anything (SAM) pre-trained on a huge number of images and promptable masks as the foundation model for extracting task-relevant features, and employ parameter-efficient fine-tuning on robot data for a better understanding of embodied scenarios.To address long-horizon reasoning, we develop a novel multichannel heatmap that enables the prediction of the action sequence in a single pass, notably enhancing execution efficiency.Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to the baselines, and also significantly improves generalization in few-shot adaptation to new tasks.

AB - Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction.Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot's end-effector.However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks and inefficient execution in long-horizon reasoning.In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning.Specifically, we adopt Segment Anything (SAM) pre-trained on a huge number of images and promptable masks as the foundation model for extracting task-relevant features, and employ parameter-efficient fine-tuning on robot data for a better understanding of embodied scenarios.To address long-horizon reasoning, we develop a novel multichannel heatmap that enables the prediction of the action sequence in a single pass, notably enhancing execution efficiency.Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to the baselines, and also significantly improves generalization in few-shot adaptation to new tasks.

UR - http://www.scopus.com/inward/record.url?scp=85203802893&partnerID=8YFLogxK

M3 - 会议文章

AN - SCOPUS:85203802893

SN - 2640-3498

VL - 235

SP - 58579

EP - 58598

JO - Proceedings of Machine Learning Research

JF - Proceedings of Machine Learning Research

Y2 - 21 July 2024 through 27 July 2024

ER -

SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

Abstract

Other files and links

Fingerprint

Cite this