Audio–visual collaborative representation learning for Dynamic Saliency Prediction

Hailong Ning; Bin Zhao; Zhanxuan Hu; Lang He; Ercheng Pei

doi:10.1016/j.knosys.2022.109675

Audio–visual collaborative representation learning for Dynamic Saliency Prediction

Hailong Ning, Bin Zhao, Zhanxuan Hu, Lang He, Ercheng Pei

School of Artificial Intelligence, OPtics and Electronics

Research output: Contribution to journal › Article › peer-review

10 Scopus citations

Abstract

The Dynamic Saliency Prediction (DSP) task simulates the human selective attention mechanism to perceive a dynamic scene, which is significant and imperative in many vision tasks. Most of existing methods only consider visual cues, while neglecting the accompanying audio information, which can provide complementary information for the scene understanding. Indeed, there exists a strong relation between auditory and visual cues, and humans generally perceive the surrounding scene by collaboratively sensing these cues. Motivated by this, an audio–visual collaborative representation learning method is proposed for the DSP task, which explores the implicit knowledge in the audio modality to better predict the dynamic saliency map by assisting the visual modality. The proposed method consists of three parts: (1) audio–visual encoding, (2) audio–visual localization, and (3) collaborative integration parts. First, a refined SoundNet architecture is adopted to encode audio modality for obtaining corresponding features, and a modified 3D ResNet-50 architecture is employed to learn visual features, containing both spatial location and temporal motion information. Secondly, an audio–visual localization part is devised to locate the sounding salient object in the visual scene by learning the correspondence between audio and visual information. Thirdly, a collaborative integration part is devised to adaptively aggregate the audio–visual information and a center-bias prior to generate the final saliency map. Extensive experiments are conducted on six challenging audiovisual eye-tracking datasets, namely DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD, which shows significant superiority over state-of-the-art DSP models.

Original language	English
Article number	109675
Journal	Knowledge-Based Systems
Volume	256
DOIs	https://doi.org/10.1016/j.knosys.2022.109675
State	Published - 28 Nov 2022

Keywords

Audio–visual
Collaborative representation learning
Dynamic Saliency Prediction
Knowledge representation
Multi-modal

Access to Document

10.1016/j.knosys.2022.109675

Cite this

@article{9e3479c2bf5c4ad08ffa1769b0d96fad,

title = "Audio–visual collaborative representation learning for Dynamic Saliency Prediction",

abstract = "The Dynamic Saliency Prediction (DSP) task simulates the human selective attention mechanism to perceive a dynamic scene, which is significant and imperative in many vision tasks. Most of existing methods only consider visual cues, while neglecting the accompanying audio information, which can provide complementary information for the scene understanding. Indeed, there exists a strong relation between auditory and visual cues, and humans generally perceive the surrounding scene by collaboratively sensing these cues. Motivated by this, an audio–visual collaborative representation learning method is proposed for the DSP task, which explores the implicit knowledge in the audio modality to better predict the dynamic saliency map by assisting the visual modality. The proposed method consists of three parts: (1) audio–visual encoding, (2) audio–visual localization, and (3) collaborative integration parts. First, a refined SoundNet architecture is adopted to encode audio modality for obtaining corresponding features, and a modified 3D ResNet-50 architecture is employed to learn visual features, containing both spatial location and temporal motion information. Secondly, an audio–visual localization part is devised to locate the sounding salient object in the visual scene by learning the correspondence between audio and visual information. Thirdly, a collaborative integration part is devised to adaptively aggregate the audio–visual information and a center-bias prior to generate the final saliency map. Extensive experiments are conducted on six challenging audiovisual eye-tracking datasets, namely DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD, which shows significant superiority over state-of-the-art DSP models.",

keywords = "Audio–visual, Collaborative representation learning, Dynamic Saliency Prediction, Knowledge representation, Multi-modal",

author = "Hailong Ning and Bin Zhao and Zhanxuan Hu and Lang He and Ercheng Pei",

note = "Publisher Copyright: {\textcopyright} 2022 Elsevier B.V.",

year = "2022",

month = nov,

day = "28",

doi = "10.1016/j.knosys.2022.109675",

language = "英语",

volume = "256",

journal = "Knowledge-Based Systems",

issn = "0950-7051",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Audio–visual collaborative representation learning for Dynamic Saliency Prediction

AU - Ning, Hailong

AU - Zhao, Bin

AU - Hu, Zhanxuan

AU - He, Lang

AU - Pei, Ercheng

PY - 2022/11/28

Y1 - 2022/11/28

N2 - The Dynamic Saliency Prediction (DSP) task simulates the human selective attention mechanism to perceive a dynamic scene, which is significant and imperative in many vision tasks. Most of existing methods only consider visual cues, while neglecting the accompanying audio information, which can provide complementary information for the scene understanding. Indeed, there exists a strong relation between auditory and visual cues, and humans generally perceive the surrounding scene by collaboratively sensing these cues. Motivated by this, an audio–visual collaborative representation learning method is proposed for the DSP task, which explores the implicit knowledge in the audio modality to better predict the dynamic saliency map by assisting the visual modality. The proposed method consists of three parts: (1) audio–visual encoding, (2) audio–visual localization, and (3) collaborative integration parts. First, a refined SoundNet architecture is adopted to encode audio modality for obtaining corresponding features, and a modified 3D ResNet-50 architecture is employed to learn visual features, containing both spatial location and temporal motion information. Secondly, an audio–visual localization part is devised to locate the sounding salient object in the visual scene by learning the correspondence between audio and visual information. Thirdly, a collaborative integration part is devised to adaptively aggregate the audio–visual information and a center-bias prior to generate the final saliency map. Extensive experiments are conducted on six challenging audiovisual eye-tracking datasets, namely DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD, which shows significant superiority over state-of-the-art DSP models.

AB - The Dynamic Saliency Prediction (DSP) task simulates the human selective attention mechanism to perceive a dynamic scene, which is significant and imperative in many vision tasks. Most of existing methods only consider visual cues, while neglecting the accompanying audio information, which can provide complementary information for the scene understanding. Indeed, there exists a strong relation between auditory and visual cues, and humans generally perceive the surrounding scene by collaboratively sensing these cues. Motivated by this, an audio–visual collaborative representation learning method is proposed for the DSP task, which explores the implicit knowledge in the audio modality to better predict the dynamic saliency map by assisting the visual modality. The proposed method consists of three parts: (1) audio–visual encoding, (2) audio–visual localization, and (3) collaborative integration parts. First, a refined SoundNet architecture is adopted to encode audio modality for obtaining corresponding features, and a modified 3D ResNet-50 architecture is employed to learn visual features, containing both spatial location and temporal motion information. Secondly, an audio–visual localization part is devised to locate the sounding salient object in the visual scene by learning the correspondence between audio and visual information. Thirdly, a collaborative integration part is devised to adaptively aggregate the audio–visual information and a center-bias prior to generate the final saliency map. Extensive experiments are conducted on six challenging audiovisual eye-tracking datasets, namely DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD, which shows significant superiority over state-of-the-art DSP models.

KW - Audio–visual

KW - Collaborative representation learning

KW - Dynamic Saliency Prediction

KW - Knowledge representation

KW - Multi-modal

UR - http://www.scopus.com/inward/record.url?scp=85138454538&partnerID=8YFLogxK

U2 - 10.1016/j.knosys.2022.109675

DO - 10.1016/j.knosys.2022.109675

M3 - 文章

AN - SCOPUS:85138454538

SN - 0950-7051

VL - 256

JO - Knowledge-Based Systems

JF - Knowledge-Based Systems

M1 - 109675

ER -

Audio–visual collaborative representation learning for Dynamic Saliency Prediction

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this