TY - JOUR
T1 - Integrating with Multimodal Information for Enhancing Robotic Grasping with Vision-Language Models
AU - Zhao, Zhou
AU - Zheng, Dongyuan
AU - Chen, Yizi
AU - Luo, Jing
AU - Wang, Yanjun
AU - Huang, Panfeng
AU - Yang, Chenguang
N1 - Publisher Copyright:
© 2004-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - As robots grow increasingly intelligent and utilize data from various sensors, relying solely on unimodal data sources is becoming inadequate for their operational needs. Consequently, integrating multimodal data has emerged as a critical area of focus. However, the effective combination of different data modalities poses a considerable challenge, especially in complex and dynamic settings where accurate object recognition and manipulation are essential. In this paper, we introduce a novel framework integrating with Multimodal Information for Grasping Synthesis with vision-language models (MIG) designed to improve robotic grasping capabilities. This framework incorporates visual data, textual information, and human-derived prior knowledge. We start by creating target object masks based on this prior knowledge, which are then used to segregate the target objects from their surroundings in the image. Subsequently, we employ language cues to refine the visual representations of these objects. Finally, our system executes precise grasping actions using visual and textual data synthesis, thus facilitating more effective and contextually aware robotic grasping. We carry out experiments using the OCID-VLG dataset.We observe that our methodology surpasses current state-of-the-art (SOTA) techniques, delivering improvements of 9.91% and 5.70% for top-1 and top-5 predictions in grasp accuracy. Moreover, when apply to the reconstructed Grasp-MultiObject dataset, our approach demonstrates even more substantial enhancements, achieving gains of 17.63% and 22.76% over SOTA methods for top-1 and top-5 predictions, respectively.
AB - As robots grow increasingly intelligent and utilize data from various sensors, relying solely on unimodal data sources is becoming inadequate for their operational needs. Consequently, integrating multimodal data has emerged as a critical area of focus. However, the effective combination of different data modalities poses a considerable challenge, especially in complex and dynamic settings where accurate object recognition and manipulation are essential. In this paper, we introduce a novel framework integrating with Multimodal Information for Grasping Synthesis with vision-language models (MIG) designed to improve robotic grasping capabilities. This framework incorporates visual data, textual information, and human-derived prior knowledge. We start by creating target object masks based on this prior knowledge, which are then used to segregate the target objects from their surroundings in the image. Subsequently, we employ language cues to refine the visual representations of these objects. Finally, our system executes precise grasping actions using visual and textual data synthesis, thus facilitating more effective and contextually aware robotic grasping. We carry out experiments using the OCID-VLG dataset.We observe that our methodology surpasses current state-of-the-art (SOTA) techniques, delivering improvements of 9.91% and 5.70% for top-1 and top-5 predictions in grasp accuracy. Moreover, when apply to the reconstructed Grasp-MultiObject dataset, our approach demonstrates even more substantial enhancements, achieving gains of 17.63% and 22.76% over SOTA methods for top-1 and top-5 predictions, respectively.
KW - human-computer interaction
KW - multimodal fusion
KW - Robot learning
KW - robotic grasping
KW - vision-language models
UR - http://www.scopus.com/inward/record.url?scp=105000030345&partnerID=8YFLogxK
U2 - 10.1109/TASE.2025.3550360
DO - 10.1109/TASE.2025.3550360
M3 - 文章
AN - SCOPUS:105000030345
SN - 1545-5955
JO - IEEE Transactions on Automation Science and Engineering
JF - IEEE Transactions on Automation Science and Engineering
ER -