Integrating with Multimodal Information for Enhancing Robotic Grasping with Vision-Language Models

Zhou Zhao, Dongyuan Zheng, Yizi Chen, Jing Luo, Yanjun Wang, Panfeng Huang, Chenguang Yang

Research output: Contribution to journalArticlepeer-review

Abstract

As robots grow increasingly intelligent and utilize data from various sensors, relying solely on unimodal data sources is becoming inadequate for their operational needs. Consequently, integrating multimodal data has emerged as a critical area of focus. However, the effective combination of different data modalities poses a considerable challenge, especially in complex and dynamic settings where accurate object recognition and manipulation are essential. In this paper, we introduce a novel framework integrating with Multimodal Information for Grasping Synthesis with vision-language models (MIG) designed to improve robotic grasping capabilities. This framework incorporates visual data, textual information, and human-derived prior knowledge. We start by creating target object masks based on this prior knowledge, which are then used to segregate the target objects from their surroundings in the image. Subsequently, we employ language cues to refine the visual representations of these objects. Finally, our system executes precise grasping actions using visual and textual data synthesis, thus facilitating more effective and contextually aware robotic grasping. We carry out experiments using the OCID-VLG dataset.We observe that our methodology surpasses current state-of-the-art (SOTA) techniques, delivering improvements of 9.91% and 5.70% for top-1 and top-5 predictions in grasp accuracy. Moreover, when apply to the reconstructed Grasp-MultiObject dataset, our approach demonstrates even more substantial enhancements, achieving gains of 17.63% and 22.76% over SOTA methods for top-1 and top-5 predictions, respectively.

Original languageEnglish
JournalIEEE Transactions on Automation Science and Engineering
DOIs
StateAccepted/In press - 2025

Keywords

  • human-computer interaction
  • multimodal fusion
  • Robot learning
  • robotic grasping
  • vision-language models

Fingerprint

Dive into the research topics of 'Integrating with Multimodal Information for Enhancing Robotic Grasping with Vision-Language Models'. Together they form a unique fingerprint.

Cite this