Audio–visual collaborative representation learning for Dynamic Saliency Prediction

Hailong Ning, Bin Zhao, Zhanxuan Hu, Lang He, Ercheng Pei

Research output: Contribution to journalArticlepeer-review

10 Scopus citations

Abstract

The Dynamic Saliency Prediction (DSP) task simulates the human selective attention mechanism to perceive a dynamic scene, which is significant and imperative in many vision tasks. Most of existing methods only consider visual cues, while neglecting the accompanying audio information, which can provide complementary information for the scene understanding. Indeed, there exists a strong relation between auditory and visual cues, and humans generally perceive the surrounding scene by collaboratively sensing these cues. Motivated by this, an audio–visual collaborative representation learning method is proposed for the DSP task, which explores the implicit knowledge in the audio modality to better predict the dynamic saliency map by assisting the visual modality. The proposed method consists of three parts: (1) audio–visual encoding, (2) audio–visual localization, and (3) collaborative integration parts. First, a refined SoundNet architecture is adopted to encode audio modality for obtaining corresponding features, and a modified 3D ResNet-50 architecture is employed to learn visual features, containing both spatial location and temporal motion information. Secondly, an audio–visual localization part is devised to locate the sounding salient object in the visual scene by learning the correspondence between audio and visual information. Thirdly, a collaborative integration part is devised to adaptively aggregate the audio–visual information and a center-bias prior to generate the final saliency map. Extensive experiments are conducted on six challenging audiovisual eye-tracking datasets, namely DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD, which shows significant superiority over state-of-the-art DSP models.

Original languageEnglish
Article number109675
JournalKnowledge-Based Systems
Volume256
DOIs
StatePublished - 28 Nov 2022

Keywords

  • Audio–visual
  • Collaborative representation learning
  • Dynamic Saliency Prediction
  • Knowledge representation
  • Multi-modal

Fingerprint

Dive into the research topics of 'Audio–visual collaborative representation learning for Dynamic Saliency Prediction'. Together they form a unique fingerprint.

Cite this