跳到主要导航 跳到搜索 跳到主要内容

DSACap: Enhancing Visual-Semantic Alignment with Diffusion-based Framework for Image Captioning

  • Liangyu Fu
  • , Junbo Wang
  • , Yuke Li
  • , Qiangguo Jin
  • , Hongsong Wang
  • , Jing Ya
  • , Linjiang Huang
  • , Liang Yao
  • , Jiangbin Zheng
  • , Xuecheng Wu
  • , Zhiyong Wang
  • Northwestern Polytechnical University Xian
  • Southeast University, Nanjing
  • Beijing University of Technology
  • Beihang University
  • Sun Yat-Sen University
  • Xi'an Jiaotong University
  • University of Sydney

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Diffusion-based image captioning methods have been proposed to address the inherent issues of autoregressive models, such as slow inference speed, significant accumulative errors, and limited generative diversity. However, due to excessive reliance on textual data and constrained training objective, existing diffusion-based methods suffer from a semantic gap between vision and language, ultimately resulting in poor quality of generated captions. To address this issue, we propose a novel diffusion-based semantics aligned image captioning framework, namely DSACap. Specifically, DSACap deviates from existing methods which treat text as the target of noise-adding and denoising, instead directly applying these processes to the image, thus reducing the loss of visual-semantic alignment. In addition, we introduce a reinforcement learning-based training strategy to maximize the semantic alignment between image and text. We feed the generated textual descriptions into an image generation model to reconstruct the original image and use the cosine similarity between the generated image and the original image as the reward to train the image captioning model. Extensive experimental results on the MS COCO dataset demonstrate that DSACap achieves a CIDEr score of 128.8, clearly outperforming existing diffusion-based image captioning methods. Our code will be made publicly open soon.

源语言英语
主期刊名MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
出版商Association for Computing Machinery, Inc
3693-3701
页数9
ISBN(电子版)9798400720352
DOI
出版状态已出版 - 27 10月 2025
活动33rd ACM International Conference on Multimedia, MM 2025 - Dublin, 爱尔兰
期限: 27 10月 202531 10月 2025

出版系列

姓名MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025

会议

会议33rd ACM International Conference on Multimedia, MM 2025
国家/地区爱尔兰
Dublin
时期27/10/2531/10/25

指纹

探究 'DSACap: Enhancing Visual-Semantic Alignment with Diffusion-based Framework for Image Captioning' 的科研主题。它们共同构成独一无二的指纹。

引用此