Large language models for robotics: Opportunities, challenges, and perspectives

Jiaqi Wang; Enze Shi; Huawen Hu; Chong Ma; Yiheng Liu; Xuhui Wang; Yincheng Yao; Xuan Liu; Bao Ge; Shu Zhang

doi:10.1016/j.jai.2024.12.003

Large language models for robotics: Opportunities, challenges, and perspectives

Jiaqi Wang, Enze Shi, Huawen Hu, Chong Ma, Yiheng Liu, Xuhui Wang, Yincheng Yao, Xuan Liu, Bao Ge, Shu Zhang

School of Computer Science

Research output: Contribution to journal › Review article › peer-review

Abstract

Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. Notably, in the realm of robot task planning, LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions. However, for embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception. This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks. Additionally, we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions. Our results, based on diverse datasets, indicate that GPT-4V effectively enhances robot performance in embodied tasks. This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights towards bridging the gap in Human-Robot-Environment interaction.

Original language	English
Pages (from-to)	52-64
Number of pages	13
Journal	Journal of Automation and Intelligence
Volume	4
Issue number	1
DOIs	https://doi.org/10.1016/j.jai.2024.12.003
State	Published - Mar 2025

Keywords

Embodied intelligence
Generative AI
Large language models
Robotics

Access to Document

10.1016/j.jai.2024.12.003

Cite this

@article{7ed038c1b6ff447280122b00e9e621b9,

title = "Large language models for robotics: Opportunities, challenges, and perspectives",

abstract = "Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. Notably, in the realm of robot task planning, LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions. However, for embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception. This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks. Additionally, we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions. Our results, based on diverse datasets, indicate that GPT-4V effectively enhances robot performance in embodied tasks. This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights towards bridging the gap in Human-Robot-Environment interaction.",

keywords = "Embodied intelligence, Generative AI, Large language models, Robotics",

author = "Jiaqi Wang and Enze Shi and Huawen Hu and Chong Ma and Yiheng Liu and Xuhui Wang and Yincheng Yao and Xuan Liu and Bao Ge and Shu Zhang",

note = "Publisher Copyright: {\textcopyright} 2024 The Authors",

year = "2025",

month = mar,

doi = "10.1016/j.jai.2024.12.003",

language = "英语",

volume = "4",

pages = "52--64",

journal = "Journal of Automation and Intelligence",

issn = "2949-8554",

publisher = "KeAi Communications Co.",

number = "1",

}

TY - JOUR

T1 - Large language models for robotics

T2 - Opportunities, challenges, and perspectives

AU - Wang, Jiaqi

AU - Shi, Enze

AU - Hu, Huawen

AU - Ma, Chong

AU - Liu, Yiheng

AU - Wang, Xuhui

AU - Yao, Yincheng

AU - Liu, Xuan

AU - Ge, Bao

AU - Zhang, Shu

PY - 2025/3

Y1 - 2025/3

N2 - Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. Notably, in the realm of robot task planning, LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions. However, for embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception. This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks. Additionally, we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions. Our results, based on diverse datasets, indicate that GPT-4V effectively enhances robot performance in embodied tasks. This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights towards bridging the gap in Human-Robot-Environment interaction.

AB - Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. Notably, in the realm of robot task planning, LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions. However, for embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception. This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks. Additionally, we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions. Our results, based on diverse datasets, indicate that GPT-4V effectively enhances robot performance in embodied tasks. This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights towards bridging the gap in Human-Robot-Environment interaction.

KW - Embodied intelligence

KW - Generative AI

KW - Large language models

KW - Robotics

UR - http://www.scopus.com/inward/record.url?scp=105001084015&partnerID=8YFLogxK

U2 - 10.1016/j.jai.2024.12.003

DO - 10.1016/j.jai.2024.12.003

M3 - 文献综述

AN - SCOPUS:105001084015

SN - 2949-8554

VL - 4

SP - 52

EP - 64

JO - Journal of Automation and Intelligence

JF - Journal of Automation and Intelligence

IS - 1

ER -

Large language models for robotics: Opportunities, challenges, and perspectives

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this