Large language models for robotics: Opportunities, challenges, and perspectives

Jiaqi Wang; Enze Shi; Huawen Hu; Chong Ma; Yiheng Liu; Xuhui Wang; Yincheng Yao; Xuan Liu; Bao Ge; Shu Zhang

doi:10.1016/j.jai.2024.12.003

Large language models for robotics: Opportunities, challenges, and perspectives

Jiaqi Wang, Enze Shi, Huawen Hu, Chong Ma, Yiheng Liu, Xuhui Wang, Yincheng Yao, Xuan Liu, Bao Ge, Shu Zhang

计算机学院

科研成果: 期刊稿件 › 文献综述 › 同行评审

1 引用（Scopus）

摘要

Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. Notably, in the realm of robot task planning, LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions. However, for embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception. This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks. Additionally, we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions. Our results, based on diverse datasets, indicate that GPT-4V effectively enhances robot performance in embodied tasks. This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights towards bridging the gap in Human-Robot-Environment interaction.

源语言	英语
页（从-至）	52-64
页数	13
期刊	Journal of Automation and Intelligence
卷	4
期	1
DOI	https://doi.org/10.1016/j.jai.2024.12.003
出版状态	已出版 - 3月 2025

访问文件

10.1016/j.jai.2024.12.003

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{7ed038c1b6ff447280122b00e9e621b9,

title = "Large language models for robotics: Opportunities, challenges, and perspectives",

abstract = "Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. Notably, in the realm of robot task planning, LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions. However, for embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception. This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks. Additionally, we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions. Our results, based on diverse datasets, indicate that GPT-4V effectively enhances robot performance in embodied tasks. This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights towards bridging the gap in Human-Robot-Environment interaction.",

keywords = "Embodied intelligence, Generative AI, Large language models, Robotics",

author = "Jiaqi Wang and Enze Shi and Huawen Hu and Chong Ma and Yiheng Liu and Xuhui Wang and Yincheng Yao and Xuan Liu and Bao Ge and Shu Zhang",

note = "Publisher Copyright: {\textcopyright} 2024 The Authors",

year = "2025",

month = mar,

doi = "10.1016/j.jai.2024.12.003",

language = "英语",

volume = "4",

pages = "52--64",

journal = "Journal of Automation and Intelligence",

issn = "2949-8554",

publisher = "KeAi Communications Co.",

number = "1",

}

TY - JOUR

T1 - Large language models for robotics

T2 - Opportunities, challenges, and perspectives

AU - Wang, Jiaqi

AU - Shi, Enze

AU - Hu, Huawen

AU - Ma, Chong

AU - Liu, Yiheng

AU - Wang, Xuhui

AU - Yao, Yincheng

AU - Liu, Xuan

AU - Ge, Bao

AU - Zhang, Shu

PY - 2025/3

Y1 - 2025/3

N2 - Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. Notably, in the realm of robot task planning, LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions. However, for embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception. This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks. Additionally, we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions. Our results, based on diverse datasets, indicate that GPT-4V effectively enhances robot performance in embodied tasks. This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights towards bridging the gap in Human-Robot-Environment interaction.

AB - Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. Notably, in the realm of robot task planning, LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions. However, for embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception. This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks. Additionally, we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions. Our results, based on diverse datasets, indicate that GPT-4V effectively enhances robot performance in embodied tasks. This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights towards bridging the gap in Human-Robot-Environment interaction.

KW - Embodied intelligence

KW - Generative AI

KW - Large language models

KW - Robotics

UR - http://www.scopus.com/inward/record.url?scp=105001084015&partnerID=8YFLogxK

U2 - 10.1016/j.jai.2024.12.003

DO - 10.1016/j.jai.2024.12.003

M3 - 文献综述

AN - SCOPUS:105001084015

SN - 2949-8554

VL - 4

SP - 52

EP - 64

JO - Journal of Automation and Intelligence

JF - Journal of Automation and Intelligence

IS - 1

ER -

Large language models for robotics: Opportunities, challenges, and perspectives

摘要

访问文件

其它文件与链接

指纹

引用此