TY - JOUR
T1 - CASP-Net
T2 - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2023
AU - Xiong, Junwen
AU - Wang, Ganglai
AU - Zhang, Peng
AU - Huang, Wei
AU - Zha, Yufei
AU - Zhai, Guangtao
N1 - Publisher Copyright:
© 2023 Association for Computing Machinery. All rights reserved.
PY - 2023
Y1 - 2023
N2 - Incorporating the audio stream enables Video Saliency Prediction (VSP) to imitate the selective attention mechanism of human brain. By focusing on the benefits of joint auditory and visual information, most VSP methods are capable of exploiting semantic correlation between vision and audio modalities but ignoring the negative effects due to the temporal inconsistency of audio-visual intrinsics. Inspired by the biological inconsistency-correction within multi-sensory information, in this study, a consistency-aware audio-visual saliency prediction network (CASP-Net) is proposed, which takes a comprehensive consideration of the audio-visual semantic interaction and consistent perception. In addition a two-stream encoder for elegant association between video frames and corresponding sound source, a novel consistency-aware predictive coding is also designed to improve the consistency within audio and visual representations iteratively. To further aggregate the multi-scale audio-visual information, a saliency decoder is introduced for the final saliency map generation. Substantial experiments demonstrate that the proposed CASP-Net outperforms the other state-of-the-art methods on six challenging audio-visual eye-tracking datasets. For a demo of our system please see our project webpage.
AB - Incorporating the audio stream enables Video Saliency Prediction (VSP) to imitate the selective attention mechanism of human brain. By focusing on the benefits of joint auditory and visual information, most VSP methods are capable of exploiting semantic correlation between vision and audio modalities but ignoring the negative effects due to the temporal inconsistency of audio-visual intrinsics. Inspired by the biological inconsistency-correction within multi-sensory information, in this study, a consistency-aware audio-visual saliency prediction network (CASP-Net) is proposed, which takes a comprehensive consideration of the audio-visual semantic interaction and consistent perception. In addition a two-stream encoder for elegant association between video frames and corresponding sound source, a novel consistency-aware predictive coding is also designed to improve the consistency within audio and visual representations iteratively. To further aggregate the multi-scale audio-visual information, a saliency decoder is introduced for the final saliency map generation. Substantial experiments demonstrate that the proposed CASP-Net outperforms the other state-of-the-art methods on six challenging audio-visual eye-tracking datasets. For a demo of our system please see our project webpage.
UR - http://www.scopus.com/inward/record.url?scp=85174119822&partnerID=8YFLogxK
U2 - 10.1109/CVPR52729.2023.00623
DO - 10.1109/CVPR52729.2023.00623
M3 - 会议文章
AN - SCOPUS:85174119822
SN - 2160-7508
SP - 6441
EP - 6450
JO - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
JF - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
Y2 - 18 June 2023 through 22 June 2023
ER -