TY - JOUR
T1 - Audio–visual collaborative representation learning for Dynamic Saliency Prediction
AU - Ning, Hailong
AU - Zhao, Bin
AU - Hu, Zhanxuan
AU - He, Lang
AU - Pei, Ercheng
N1 - Publisher Copyright:
© 2022 Elsevier B.V.
PY - 2022/11/28
Y1 - 2022/11/28
N2 - The Dynamic Saliency Prediction (DSP) task simulates the human selective attention mechanism to perceive a dynamic scene, which is significant and imperative in many vision tasks. Most of existing methods only consider visual cues, while neglecting the accompanying audio information, which can provide complementary information for the scene understanding. Indeed, there exists a strong relation between auditory and visual cues, and humans generally perceive the surrounding scene by collaboratively sensing these cues. Motivated by this, an audio–visual collaborative representation learning method is proposed for the DSP task, which explores the implicit knowledge in the audio modality to better predict the dynamic saliency map by assisting the visual modality. The proposed method consists of three parts: (1) audio–visual encoding, (2) audio–visual localization, and (3) collaborative integration parts. First, a refined SoundNet architecture is adopted to encode audio modality for obtaining corresponding features, and a modified 3D ResNet-50 architecture is employed to learn visual features, containing both spatial location and temporal motion information. Secondly, an audio–visual localization part is devised to locate the sounding salient object in the visual scene by learning the correspondence between audio and visual information. Thirdly, a collaborative integration part is devised to adaptively aggregate the audio–visual information and a center-bias prior to generate the final saliency map. Extensive experiments are conducted on six challenging audiovisual eye-tracking datasets, namely DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD, which shows significant superiority over state-of-the-art DSP models.
AB - The Dynamic Saliency Prediction (DSP) task simulates the human selective attention mechanism to perceive a dynamic scene, which is significant and imperative in many vision tasks. Most of existing methods only consider visual cues, while neglecting the accompanying audio information, which can provide complementary information for the scene understanding. Indeed, there exists a strong relation between auditory and visual cues, and humans generally perceive the surrounding scene by collaboratively sensing these cues. Motivated by this, an audio–visual collaborative representation learning method is proposed for the DSP task, which explores the implicit knowledge in the audio modality to better predict the dynamic saliency map by assisting the visual modality. The proposed method consists of three parts: (1) audio–visual encoding, (2) audio–visual localization, and (3) collaborative integration parts. First, a refined SoundNet architecture is adopted to encode audio modality for obtaining corresponding features, and a modified 3D ResNet-50 architecture is employed to learn visual features, containing both spatial location and temporal motion information. Secondly, an audio–visual localization part is devised to locate the sounding salient object in the visual scene by learning the correspondence between audio and visual information. Thirdly, a collaborative integration part is devised to adaptively aggregate the audio–visual information and a center-bias prior to generate the final saliency map. Extensive experiments are conducted on six challenging audiovisual eye-tracking datasets, namely DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD, which shows significant superiority over state-of-the-art DSP models.
KW - Audio–visual
KW - Collaborative representation learning
KW - Dynamic Saliency Prediction
KW - Knowledge representation
KW - Multi-modal
UR - http://www.scopus.com/inward/record.url?scp=85138454538&partnerID=8YFLogxK
U2 - 10.1016/j.knosys.2022.109675
DO - 10.1016/j.knosys.2022.109675
M3 - 文章
AN - SCOPUS:85138454538
SN - 0950-7051
VL - 256
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
M1 - 109675
ER -