TY - JOUR
T1 - Large language models for robotics
T2 - Opportunities, challenges, and perspectives
AU - Wang, Jiaqi
AU - Shi, Enze
AU - Hu, Huawen
AU - Ma, Chong
AU - Liu, Yiheng
AU - Wang, Xuhui
AU - Yao, Yincheng
AU - Liu, Xuan
AU - Ge, Bao
AU - Zhang, Shu
N1 - Publisher Copyright:
© 2024 The Authors
PY - 2025/3
Y1 - 2025/3
N2 - Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. Notably, in the realm of robot task planning, LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions. However, for embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception. This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks. Additionally, we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions. Our results, based on diverse datasets, indicate that GPT-4V effectively enhances robot performance in embodied tasks. This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights towards bridging the gap in Human-Robot-Environment interaction.
AB - Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. Notably, in the realm of robot task planning, LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions. However, for embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception. This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks. Additionally, we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions. Our results, based on diverse datasets, indicate that GPT-4V effectively enhances robot performance in embodied tasks. This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights towards bridging the gap in Human-Robot-Environment interaction.
KW - Embodied intelligence
KW - Generative AI
KW - Large language models
KW - Robotics
UR - http://www.scopus.com/inward/record.url?scp=105001084015&partnerID=8YFLogxK
U2 - 10.1016/j.jai.2024.12.003
DO - 10.1016/j.jai.2024.12.003
M3 - 文献综述
AN - SCOPUS:105001084015
SN - 2949-8554
VL - 4
SP - 52
EP - 64
JO - Journal of Automation and Intelligence
JF - Journal of Automation and Intelligence
IS - 1
ER -