Giving Text More Imagination Space for Image-text Matching

Xinfeng Dong; Longfei Han; Dingwen Zhang; Li Liu; Junwei Han; Huaxiang Zhang

doi:10.1145/3581783.3612103

Giving Text More Imagination Space for Image-text Matching

Xinfeng Dong, Longfei Han, Dingwen Zhang, Li Liu, Junwei Han, Huaxiang Zhang

自动化学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

4 引用（Scopus）

摘要

Image-text matching is a hot topic in multi-modal analysis. The existing image-text matching algorithms focus on bridging the heterogeneity gap and mapping the feature into a common space under strong alignment assumption. However, these methods have unsatisfactory performance under the weak alignment scenario, which assumes that the text contains more abstract information, and the number of entities in the text is always fewer than objects in image. This is the first time, from our knowledge, to solve the image-text matching problem from the perspective of information difference with weak alignment. In order to both narrow the cross-modal heterogeneity gap and balance the information discrepancy, we proposed an imagination network to enrich the text modality based on pre-trained framework, which is helpful for image-text matching. The imagination network utilizes reinforcement learning to enhance the semantic information for text modality, and an action refinement strategy is designed to constrain the freedom and divergence of imagination. The experiment results show the superiority and generality of the proposed framework based on two pre-trained models, CLIP and BLIP on two most frequently-used datasets MSCOCO and Flickr30K.

源语言	英语
主期刊名	MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
出版商	Association for Computing Machinery, Inc
页	6359-6368
页数	10
ISBN（电子版）	9798400701085
DOI	https://doi.org/10.1145/3581783.3612103
出版状态	已出版 - 26 10月 2023
活动	31st ACM International Conference on Multimedia, MM 2023 - Ottawa, 加拿大期限: 29 10月 2023 → 3 11月 2023

出版系列

姓名	MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

会议

会议	31st ACM International Conference on Multimedia, MM 2023
国家/地区	加拿大
市	Ottawa
时期	29/10/23 → 3/11/23

访问文件

10.1145/3581783.3612103

其它文件与链接

链接到 Scopus 的出版物

引用此

Dong, X., Han, L., Zhang, D., Liu, L., Han, J., & Zhang, H. (2023). Giving Text More Imagination Space for Image-text Matching. 在 MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia (页码 6359-6368). (MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia). Association for Computing Machinery, Inc. https://doi.org/10.1145/3581783.3612103

@inproceedings{515118994c67493fb80c41c6b8bedc30,

title = "Giving Text More Imagination Space for Image-text Matching",

abstract = "Image-text matching is a hot topic in multi-modal analysis. The existing image-text matching algorithms focus on bridging the heterogeneity gap and mapping the feature into a common space under strong alignment assumption. However, these methods have unsatisfactory performance under the weak alignment scenario, which assumes that the text contains more abstract information, and the number of entities in the text is always fewer than objects in image. This is the first time, from our knowledge, to solve the image-text matching problem from the perspective of information difference with weak alignment. In order to both narrow the cross-modal heterogeneity gap and balance the information discrepancy, we proposed an imagination network to enrich the text modality based on pre-trained framework, which is helpful for image-text matching. The imagination network utilizes reinforcement learning to enhance the semantic information for text modality, and an action refinement strategy is designed to constrain the freedom and divergence of imagination. The experiment results show the superiority and generality of the proposed framework based on two pre-trained models, CLIP and BLIP on two most frequently-used datasets MSCOCO and Flickr30K.",

keywords = "image-text matching, information enhancement, reinforcement learning",

author = "Xinfeng Dong and Longfei Han and Dingwen Zhang and Li Liu and Junwei Han and Huaxiang Zhang",

note = "Publisher Copyright: {\textcopyright} 2023 ACM.; 31st ACM International Conference on Multimedia, MM 2023 ; Conference date: 29-10-2023 Through 03-11-2023",

year = "2023",

month = oct,

day = "26",

doi = "10.1145/3581783.3612103",

language = "英语",

series = "MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery, Inc",

pages = "6359--6368",

booktitle = "MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia",

}

Dong, X, Han, L, Zhang, D, Liu, L, Han, J & Zhang, H 2023, Giving Text More Imagination Space for Image-text Matching. 在 MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia. MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia, Association for Computing Machinery, Inc, 页码 6359-6368, 31st ACM International Conference on Multimedia, MM 2023, Ottawa, 加拿大, 29/10/23. https://doi.org/10.1145/3581783.3612103

Giving Text More Imagination Space for Image-text Matching. / Dong, Xinfeng; Han, Longfei; Zhang, Dingwen 等.
MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2023. 页码 6359-6368 (MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Giving Text More Imagination Space for Image-text Matching

AU - Dong, Xinfeng

AU - Han, Longfei

AU - Zhang, Dingwen

AU - Liu, Li

AU - Han, Junwei

AU - Zhang, Huaxiang

PY - 2023/10/26

Y1 - 2023/10/26

N2 - Image-text matching is a hot topic in multi-modal analysis. The existing image-text matching algorithms focus on bridging the heterogeneity gap and mapping the feature into a common space under strong alignment assumption. However, these methods have unsatisfactory performance under the weak alignment scenario, which assumes that the text contains more abstract information, and the number of entities in the text is always fewer than objects in image. This is the first time, from our knowledge, to solve the image-text matching problem from the perspective of information difference with weak alignment. In order to both narrow the cross-modal heterogeneity gap and balance the information discrepancy, we proposed an imagination network to enrich the text modality based on pre-trained framework, which is helpful for image-text matching. The imagination network utilizes reinforcement learning to enhance the semantic information for text modality, and an action refinement strategy is designed to constrain the freedom and divergence of imagination. The experiment results show the superiority and generality of the proposed framework based on two pre-trained models, CLIP and BLIP on two most frequently-used datasets MSCOCO and Flickr30K.

AB - Image-text matching is a hot topic in multi-modal analysis. The existing image-text matching algorithms focus on bridging the heterogeneity gap and mapping the feature into a common space under strong alignment assumption. However, these methods have unsatisfactory performance under the weak alignment scenario, which assumes that the text contains more abstract information, and the number of entities in the text is always fewer than objects in image. This is the first time, from our knowledge, to solve the image-text matching problem from the perspective of information difference with weak alignment. In order to both narrow the cross-modal heterogeneity gap and balance the information discrepancy, we proposed an imagination network to enrich the text modality based on pre-trained framework, which is helpful for image-text matching. The imagination network utilizes reinforcement learning to enhance the semantic information for text modality, and an action refinement strategy is designed to constrain the freedom and divergence of imagination. The experiment results show the superiority and generality of the proposed framework based on two pre-trained models, CLIP and BLIP on two most frequently-used datasets MSCOCO and Flickr30K.

KW - image-text matching

KW - information enhancement

KW - reinforcement learning

UR - http://www.scopus.com/inward/record.url?scp=85179556160&partnerID=8YFLogxK

U2 - 10.1145/3581783.3612103

DO - 10.1145/3581783.3612103

M3 - 会议稿件

AN - SCOPUS:85179556160

T3 - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

SP - 6359

EP - 6368

BT - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

T2 - 31st ACM International Conference on Multimedia, MM 2023

Y2 - 29 October 2023 through 3 November 2023

ER -

Giving Text More Imagination Space for Image-text Matching

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此