TY - GEN
T1 - Giving Text More Imagination Space for Image-text Matching
AU - Dong, Xinfeng
AU - Han, Longfei
AU - Zhang, Dingwen
AU - Liu, Li
AU - Han, Junwei
AU - Zhang, Huaxiang
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/10/26
Y1 - 2023/10/26
N2 - Image-text matching is a hot topic in multi-modal analysis. The existing image-text matching algorithms focus on bridging the heterogeneity gap and mapping the feature into a common space under strong alignment assumption. However, these methods have unsatisfactory performance under the weak alignment scenario, which assumes that the text contains more abstract information, and the number of entities in the text is always fewer than objects in image. This is the first time, from our knowledge, to solve the image-text matching problem from the perspective of information difference with weak alignment. In order to both narrow the cross-modal heterogeneity gap and balance the information discrepancy, we proposed an imagination network to enrich the text modality based on pre-trained framework, which is helpful for image-text matching. The imagination network utilizes reinforcement learning to enhance the semantic information for text modality, and an action refinement strategy is designed to constrain the freedom and divergence of imagination. The experiment results show the superiority and generality of the proposed framework based on two pre-trained models, CLIP and BLIP on two most frequently-used datasets MSCOCO and Flickr30K.
AB - Image-text matching is a hot topic in multi-modal analysis. The existing image-text matching algorithms focus on bridging the heterogeneity gap and mapping the feature into a common space under strong alignment assumption. However, these methods have unsatisfactory performance under the weak alignment scenario, which assumes that the text contains more abstract information, and the number of entities in the text is always fewer than objects in image. This is the first time, from our knowledge, to solve the image-text matching problem from the perspective of information difference with weak alignment. In order to both narrow the cross-modal heterogeneity gap and balance the information discrepancy, we proposed an imagination network to enrich the text modality based on pre-trained framework, which is helpful for image-text matching. The imagination network utilizes reinforcement learning to enhance the semantic information for text modality, and an action refinement strategy is designed to constrain the freedom and divergence of imagination. The experiment results show the superiority and generality of the proposed framework based on two pre-trained models, CLIP and BLIP on two most frequently-used datasets MSCOCO and Flickr30K.
KW - image-text matching
KW - information enhancement
KW - reinforcement learning
UR - http://www.scopus.com/inward/record.url?scp=85179556160&partnerID=8YFLogxK
U2 - 10.1145/3581783.3612103
DO - 10.1145/3581783.3612103
M3 - 会议稿件
AN - SCOPUS:85179556160
T3 - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
SP - 6359
EP - 6368
BT - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
T2 - 31st ACM International Conference on Multimedia, MM 2023
Y2 - 29 October 2023 through 3 November 2023
ER -