CASP-Net: Rethinking Video Saliency Prediction from an Audio-Visual Consistency Perceptual Perspective

Junwen Xiong; Ganglai Wang; Peng Zhang; Wei Huang; Yufei Zha; Guangtao Zhai

doi:10.1109/CVPR52729.2023.00623

CASP-Net: Rethinking Video Saliency Prediction from an Audio-Visual Consistency Perceptual Perspective

Junwen Xiong, Ganglai Wang, Peng Zhang, Wei Huang, Yufei Zha, Guangtao Zhai

计算机学院

科研成果: 期刊稿件 › 会议文章 › 同行评审

13 引用（Scopus）

摘要

Incorporating the audio stream enables Video Saliency Prediction (VSP) to imitate the selective attention mechanism of human brain. By focusing on the benefits of joint auditory and visual information, most VSP methods are capable of exploiting semantic correlation between vision and audio modalities but ignoring the negative effects due to the temporal inconsistency of audio-visual intrinsics. Inspired by the biological inconsistency-correction within multi-sensory information, in this study, a consistency-aware audio-visual saliency prediction network (CASP-Net) is proposed, which takes a comprehensive consideration of the audio-visual semantic interaction and consistent perception. In addition a two-stream encoder for elegant association between video frames and corresponding sound source, a novel consistency-aware predictive coding is also designed to improve the consistency within audio and visual representations iteratively. To further aggregate the multi-scale audio-visual information, a saliency decoder is introduced for the final saliency map generation. Substantial experiments demonstrate that the proposed CASP-Net outperforms the other state-of-the-art methods on six challenging audio-visual eye-tracking datasets. For a demo of our system please see our project webpage.

源语言	英语
页（从-至）	6441-6450
页数	10
期刊	IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
DOI	https://doi.org/10.1109/CVPR52729.2023.00623
出版状态	已出版 - 2023
活动	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2023 - Vancouver, 加拿大期限: 18 6月 2023 → 22 6月 2023

访问文件

10.1109/CVPR52729.2023.00623

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{39515f84146248afa88a72fe7a9cbf5b,

title = "CASP-Net: Rethinking Video Saliency Prediction from an Audio-Visual Consistency Perceptual Perspective",

abstract = "Incorporating the audio stream enables Video Saliency Prediction (VSP) to imitate the selective attention mechanism of human brain. By focusing on the benefits of joint auditory and visual information, most VSP methods are capable of exploiting semantic correlation between vision and audio modalities but ignoring the negative effects due to the temporal inconsistency of audio-visual intrinsics. Inspired by the biological inconsistency-correction within multi-sensory information, in this study, a consistency-aware audio-visual saliency prediction network (CASP-Net) is proposed, which takes a comprehensive consideration of the audio-visual semantic interaction and consistent perception. In addition a two-stream encoder for elegant association between video frames and corresponding sound source, a novel consistency-aware predictive coding is also designed to improve the consistency within audio and visual representations iteratively. To further aggregate the multi-scale audio-visual information, a saliency decoder is introduced for the final saliency map generation. Substantial experiments demonstrate that the proposed CASP-Net outperforms the other state-of-the-art methods on six challenging audio-visual eye-tracking datasets. For a demo of our system please see our project webpage.",

author = "Junwen Xiong and Ganglai Wang and Peng Zhang and Wei Huang and Yufei Zha and Guangtao Zhai",

note = "Publisher Copyright: {\textcopyright} 2023 Association for Computing Machinery. All rights reserved.; 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2023 ; Conference date: 18-06-2023 Through 22-06-2023",

year = "2023",

doi = "10.1109/CVPR52729.2023.00623",

language = "英语",

pages = "6441--6450",

journal = "IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops",

issn = "2160-7508",

publisher = "IEEE Computer Society",

}

TY - JOUR

T1 - CASP-Net

T2 - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2023

AU - Xiong, Junwen

AU - Wang, Ganglai

AU - Zhang, Peng

AU - Huang, Wei

AU - Zha, Yufei

AU - Zhai, Guangtao

PY - 2023

Y1 - 2023

N2 - Incorporating the audio stream enables Video Saliency Prediction (VSP) to imitate the selective attention mechanism of human brain. By focusing on the benefits of joint auditory and visual information, most VSP methods are capable of exploiting semantic correlation between vision and audio modalities but ignoring the negative effects due to the temporal inconsistency of audio-visual intrinsics. Inspired by the biological inconsistency-correction within multi-sensory information, in this study, a consistency-aware audio-visual saliency prediction network (CASP-Net) is proposed, which takes a comprehensive consideration of the audio-visual semantic interaction and consistent perception. In addition a two-stream encoder for elegant association between video frames and corresponding sound source, a novel consistency-aware predictive coding is also designed to improve the consistency within audio and visual representations iteratively. To further aggregate the multi-scale audio-visual information, a saliency decoder is introduced for the final saliency map generation. Substantial experiments demonstrate that the proposed CASP-Net outperforms the other state-of-the-art methods on six challenging audio-visual eye-tracking datasets. For a demo of our system please see our project webpage.

AB - Incorporating the audio stream enables Video Saliency Prediction (VSP) to imitate the selective attention mechanism of human brain. By focusing on the benefits of joint auditory and visual information, most VSP methods are capable of exploiting semantic correlation between vision and audio modalities but ignoring the negative effects due to the temporal inconsistency of audio-visual intrinsics. Inspired by the biological inconsistency-correction within multi-sensory information, in this study, a consistency-aware audio-visual saliency prediction network (CASP-Net) is proposed, which takes a comprehensive consideration of the audio-visual semantic interaction and consistent perception. In addition a two-stream encoder for elegant association between video frames and corresponding sound source, a novel consistency-aware predictive coding is also designed to improve the consistency within audio and visual representations iteratively. To further aggregate the multi-scale audio-visual information, a saliency decoder is introduced for the final saliency map generation. Substantial experiments demonstrate that the proposed CASP-Net outperforms the other state-of-the-art methods on six challenging audio-visual eye-tracking datasets. For a demo of our system please see our project webpage.

UR - http://www.scopus.com/inward/record.url?scp=85174119822&partnerID=8YFLogxK

U2 - 10.1109/CVPR52729.2023.00623

DO - 10.1109/CVPR52729.2023.00623

M3 - 会议文章

AN - SCOPUS:85174119822

SN - 2160-7508

SP - 6441

EP - 6450

JO - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops

JF - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops

Y2 - 18 June 2023 through 22 June 2023

ER -

CASP-Net: Rethinking Video Saliency Prediction from an Audio-Visual Consistency Perceptual Perspective

摘要

访问文件

其它文件与链接

指纹

引用此