Skip to main navigation Skip to search Skip to main content

DSACap: Enhancing Visual-Semantic Alignment with Diffusion-based Framework for Image Captioning

  • Liangyu Fu
  • , Junbo Wang
  • , Yuke Li
  • , Qiangguo Jin
  • , Hongsong Wang
  • , Jing Ya
  • , Linjiang Huang
  • , Liang Yao
  • , Jiangbin Zheng
  • , Xuecheng Wu
  • , Zhiyong Wang
  • Northwestern Polytechnical University Xian
  • Southeast University, Nanjing
  • Beijing University of Technology
  • Beihang University
  • Sun Yat-Sen University
  • Xi'an Jiaotong University
  • University of Sydney

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Diffusion-based image captioning methods have been proposed to address the inherent issues of autoregressive models, such as slow inference speed, significant accumulative errors, and limited generative diversity. However, due to excessive reliance on textual data and constrained training objective, existing diffusion-based methods suffer from a semantic gap between vision and language, ultimately resulting in poor quality of generated captions. To address this issue, we propose a novel diffusion-based semantics aligned image captioning framework, namely DSACap. Specifically, DSACap deviates from existing methods which treat text as the target of noise-adding and denoising, instead directly applying these processes to the image, thus reducing the loss of visual-semantic alignment. In addition, we introduce a reinforcement learning-based training strategy to maximize the semantic alignment between image and text. We feed the generated textual descriptions into an image generation model to reconstruct the original image and use the cosine similarity between the generated image and the original image as the reward to train the image captioning model. Extensive experimental results on the MS COCO dataset demonstrate that DSACap achieves a CIDEr score of 128.8, clearly outperforming existing diffusion-based image captioning methods. Our code will be made publicly open soon.

Original languageEnglish
Title of host publicationMM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PublisherAssociation for Computing Machinery, Inc
Pages3693-3701
Number of pages9
ISBN (Electronic)9798400720352
DOIs
StatePublished - 27 Oct 2025
Event33rd ACM International Conference on Multimedia, MM 2025 - Dublin, Ireland
Duration: 27 Oct 202531 Oct 2025

Publication series

NameMM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025

Conference

Conference33rd ACM International Conference on Multimedia, MM 2025
Country/TerritoryIreland
CityDublin
Period27/10/2531/10/25

Keywords

  • diffusion model
  • image captioning
  • vision and language

Fingerprint

Dive into the research topics of 'DSACap: Enhancing Visual-Semantic Alignment with Diffusion-based Framework for Image Captioning'. Together they form a unique fingerprint.

Cite this