跳到主要导航 跳到搜索 跳到主要内容

DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction

  • Junwen Xiong
  • , Peng Zhang
  • , Tao You
  • , Chuanyue Li
  • , Wei Huang
  • , Yufei Zha
  • Northwestern Polytechnical University Xian
  • Nanchang University

科研成果: 书/报告/会议事项章节会议稿件同行评审

19 引用 (Scopus)

摘要

Audio-visual saliency prediction can draw support from diverse modality complements, but further performance enhancement is still challenged by customized architectures as well as task-specific loss functions. In recent studies, denoising diffusion models have shown more promising in unifying task frameworks owing to their inherent ability of generalization. Following this motivation, a novel Diffusion architecture for generalized audio-visual Saliency prediction (DiffSal) is proposed in this work, which formulates the prediction problem as a conditional generative task of the saliency map by utilizing input audio and video as the conditions. Based on the spatiotemporal audio-visual features, an extra network Saliency-UNet is designed to perform multimodal attention modulation for progressive refinement of the ground-truth saliency map from the noisy map. Extensive experiments demonstrate that the proposed DiffSal can achieve excellent performance across six challenging audio-visual benchmarks, with an average relative improvement of 6.3% over the previous state-of-the-art results by six metrics. The project url is htt ps: //junwenxiong. github.io/DiffSal.

源语言英语
主期刊名Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
出版商IEEE Computer Society
27263-27273
页数11
ISBN(电子版)9798350353006
DOI
出版状态已出版 - 2024
活动2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Seattle, 美国
期限: 16 6月 202422 6月 2024

出版系列

姓名Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN(印刷版)1063-6919

会议

会议2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
国家/地区美国
Seattle
时期16/06/2422/06/24

指纹

探究 'DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction' 的科研主题。它们共同构成独一无二的指纹。

引用此