TY - GEN
T1 - DiffSal
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
AU - Xiong, Junwen
AU - Zhang, Peng
AU - You, Tao
AU - Li, Chuanyue
AU - Huang, Wei
AU - Zha, Yufei
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Audio-visual saliency prediction can draw support from diverse modality complements, but further performance enhancement is still challenged by customized architectures as well as task-specific loss functions. In recent studies, denoising diffusion models have shown more promising in unifying task frameworks owing to their inherent ability of generalization. Following this motivation, a novel Diffusion architecture for generalized audio-visual Saliency prediction (DiffSal) is proposed in this work, which formulates the prediction problem as a conditional generative task of the saliency map by utilizing input audio and video as the conditions. Based on the spatiotemporal audio-visual features, an extra network Saliency-UNet is designed to perform multimodal attention modulation for progressive refinement of the ground-truth saliency map from the noisy map. Extensive experiments demonstrate that the proposed DiffSal can achieve excellent performance across six challenging audio-visual benchmarks, with an average relative improvement of 6.3% over the previous state-of-the-art results by six metrics. The project url is htt ps: //junwenxiong. github.io/DiffSal.
AB - Audio-visual saliency prediction can draw support from diverse modality complements, but further performance enhancement is still challenged by customized architectures as well as task-specific loss functions. In recent studies, denoising diffusion models have shown more promising in unifying task frameworks owing to their inherent ability of generalization. Following this motivation, a novel Diffusion architecture for generalized audio-visual Saliency prediction (DiffSal) is proposed in this work, which formulates the prediction problem as a conditional generative task of the saliency map by utilizing input audio and video as the conditions. Based on the spatiotemporal audio-visual features, an extra network Saliency-UNet is designed to perform multimodal attention modulation for progressive refinement of the ground-truth saliency map from the noisy map. Extensive experiments demonstrate that the proposed DiffSal can achieve excellent performance across six challenging audio-visual benchmarks, with an average relative improvement of 6.3% over the previous state-of-the-art results by six metrics. The project url is htt ps: //junwenxiong. github.io/DiffSal.
UR - https://www.scopus.com/pages/publications/85204159917
U2 - 10.1109/CVPR52733.2024.02575
DO - 10.1109/CVPR52733.2024.02575
M3 - 会议稿件
AN - SCOPUS:85204159917
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 27263
EP - 27273
BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PB - IEEE Computer Society
Y2 - 16 June 2024 through 22 June 2024
ER -