跳到主要导航 跳到搜索 跳到主要内容

Integrating With Multimodal Information for Enhancing Robotic Grasping With Vision-Language Models

  • Zhou Zhao
  • , Dongyuan Zheng
  • , Yizi Chen
  • , Jing Luo
  • , Yanjun Wang
  • , Panfeng Huang
  • , Chenguang Yang
  • Central China Normal University
  • Hubei Engineering Research Center for Intelligent Detection and Identification of Complex Parts
  • Swiss Federal Institute of Technology Zurich
  • Wuhan University of Technology
  • Shanghai Jiao Tong University
  • University of Liverpool

科研成果: 期刊稿件文章同行评审

12 引用 (Scopus)

摘要

As robots grow increasingly intelligent and utilize data from various sensors, relying solely on unimodal data sources is becoming inadequate for their operational needs. Consequently, integrating multimodal data has emerged as a critical area of focus. However, the effective combination of different data modalities poses a considerable challenge, especially in complex and dynamic settings where accurate object recognition and manipulation are essential. In this paper, we introduce a novel framework integrating with Multimodal Information for Grasping Synthesis with vision-language models (MIG) designed to improve robotic grasping capabilities. This framework incorporates visual data, textual information, and human-derived prior knowledge. We start by creating target object masks based on this prior knowledge, which are then used to segregate the target objects from their surroundings in the image. Subsequently, we employ language cues to refine the visual representations of these objects. Finally, our system executes precise grasping actions using visual and textual data synthesis, thus facilitating more effective and contextually aware robotic grasping. We carry out experiments using the OCID-VLG dataset. We observe that our methodology surpasses current state-of-the-art (SOTA) techniques, delivering improvements of 9.91% and 5.70% for top-1 and top-5 predictions in grasp accuracy. Moreover, when apply to the reconstructed Grasp-MultiObject dataset, our approach demonstrates even more substantial enhancements, achieving gains of 17.63% and 22.76% over SOTA methods for top-1 and top-5 predictions, respectively.

源语言英语
页(从-至)13073-13086
页数14
期刊IEEE Transactions on Automation Science and Engineering
22
DOI
出版状态已出版 - 2025

指纹

探究 'Integrating With Multimodal Information for Enhancing Robotic Grasping With Vision-Language Models' 的科研主题。它们共同构成独一无二的指纹。

引用此