TY - GEN
T1 - DSACap
T2 - 33rd ACM International Conference on Multimedia, MM 2025
AU - Fu, Liangyu
AU - Wang, Junbo
AU - Li, Yuke
AU - Jin, Qiangguo
AU - Wang, Hongsong
AU - Ya, Jing
AU - Huang, Linjiang
AU - Yao, Liang
AU - Zheng, Jiangbin
AU - Wu, Xuecheng
AU - Wang, Zhiyong
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/10/27
Y1 - 2025/10/27
N2 - Diffusion-based image captioning methods have been proposed to address the inherent issues of autoregressive models, such as slow inference speed, significant accumulative errors, and limited generative diversity. However, due to excessive reliance on textual data and constrained training objective, existing diffusion-based methods suffer from a semantic gap between vision and language, ultimately resulting in poor quality of generated captions. To address this issue, we propose a novel diffusion-based semantics aligned image captioning framework, namely DSACap. Specifically, DSACap deviates from existing methods which treat text as the target of noise-adding and denoising, instead directly applying these processes to the image, thus reducing the loss of visual-semantic alignment. In addition, we introduce a reinforcement learning-based training strategy to maximize the semantic alignment between image and text. We feed the generated textual descriptions into an image generation model to reconstruct the original image and use the cosine similarity between the generated image and the original image as the reward to train the image captioning model. Extensive experimental results on the MS COCO dataset demonstrate that DSACap achieves a CIDEr score of 128.8, clearly outperforming existing diffusion-based image captioning methods. Our code will be made publicly open soon.
AB - Diffusion-based image captioning methods have been proposed to address the inherent issues of autoregressive models, such as slow inference speed, significant accumulative errors, and limited generative diversity. However, due to excessive reliance on textual data and constrained training objective, existing diffusion-based methods suffer from a semantic gap between vision and language, ultimately resulting in poor quality of generated captions. To address this issue, we propose a novel diffusion-based semantics aligned image captioning framework, namely DSACap. Specifically, DSACap deviates from existing methods which treat text as the target of noise-adding and denoising, instead directly applying these processes to the image, thus reducing the loss of visual-semantic alignment. In addition, we introduce a reinforcement learning-based training strategy to maximize the semantic alignment between image and text. We feed the generated textual descriptions into an image generation model to reconstruct the original image and use the cosine similarity between the generated image and the original image as the reward to train the image captioning model. Extensive experimental results on the MS COCO dataset demonstrate that DSACap achieves a CIDEr score of 128.8, clearly outperforming existing diffusion-based image captioning methods. Our code will be made publicly open soon.
KW - diffusion model
KW - image captioning
KW - vision and language
UR - https://www.scopus.com/pages/publications/105024069701
U2 - 10.1145/3746027.3755156
DO - 10.1145/3746027.3755156
M3 - 会议稿件
AN - SCOPUS:105024069701
T3 - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
SP - 3693
EP - 3701
BT - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PB - Association for Computing Machinery, Inc
Y2 - 27 October 2025 through 31 October 2025
ER -