AudioVisual Video Summarization

Bin Zhao; Maoguo Gong; Xuelong Li

doi:10.1109/TNNLS.2021.3119969

AudioVisual Video Summarization

Bin Zhao, Maoguo Gong, Xuelong Li

School of Artificial Intelligence, OPtics and Electronics

Research output: Contribution to journal › Article › peer-review

27 Scopus citations

Abstract

Audio and vision are two main modalities in video data. Multimodal learning, especially for audiovisual learning, has drawn considerable attention recently, which can boost the performance of various computer vision tasks. However, in video summarization, most existing approaches just exploit the visual information while neglecting the audio information. In this brief, we argue that the audio modality can assist vision modality to better understand the video content and structure and further benefit the summarization process. Motivated by this, we propose to jointly exploit the audio and visual information for the video summarization task and develop an audiovisual recurrent network (AVRN) to achieve this. Specifically, the proposed AVRN can be separated into three parts: 1) the two-stream long-short term memory (LSTM) is used to encode the audio and visual feature sequentially by capturing their temporal dependency; 2) the audiovisual fusion LSTM is used to fuse the two modalities by exploring the latent consistency between them; and 3) the self-attention video encoder is adopted to capture the global dependency in the video. Finally, the fused audiovisual information and the integrated temporal and global dependencies are jointly used to predict the video summary. Practically, the experimental results on the two benchmarks, i.e., SumMe and TVsum, have demonstrated the effectiveness of each part and the superiority of AVRN compared with those approaches just exploiting visual information for video summarization.

Original language	English
Pages (from-to)	5181-5188
Number of pages	8
Journal	IEEE Transactions on Neural Networks and Learning Systems
Volume	34
Issue number	8
DOIs	https://doi.org/10.1109/TNNLS.2021.3119969
State	Published - 1 Aug 2023

Keywords

Audiovisual learning
multimodal learning
recurrent network
video summarization

Access to Document

10.1109/TNNLS.2021.3119969

Cite this

@article{aa04aee7d4df4bf39b4404e2fefc02fc,

title = "AudioVisual Video Summarization",

abstract = "Audio and vision are two main modalities in video data. Multimodal learning, especially for audiovisual learning, has drawn considerable attention recently, which can boost the performance of various computer vision tasks. However, in video summarization, most existing approaches just exploit the visual information while neglecting the audio information. In this brief, we argue that the audio modality can assist vision modality to better understand the video content and structure and further benefit the summarization process. Motivated by this, we propose to jointly exploit the audio and visual information for the video summarization task and develop an audiovisual recurrent network (AVRN) to achieve this. Specifically, the proposed AVRN can be separated into three parts: 1) the two-stream long-short term memory (LSTM) is used to encode the audio and visual feature sequentially by capturing their temporal dependency; 2) the audiovisual fusion LSTM is used to fuse the two modalities by exploring the latent consistency between them; and 3) the self-attention video encoder is adopted to capture the global dependency in the video. Finally, the fused audiovisual information and the integrated temporal and global dependencies are jointly used to predict the video summary. Practically, the experimental results on the two benchmarks, i.e., SumMe and TVsum, have demonstrated the effectiveness of each part and the superiority of AVRN compared with those approaches just exploiting visual information for video summarization.",

keywords = "Audiovisual learning, multimodal learning, recurrent network, video summarization",

author = "Bin Zhao and Maoguo Gong and Xuelong Li",

note = "Publisher Copyright: {\textcopyright} 2012 IEEE.",

year = "2023",

month = aug,

day = "1",

doi = "10.1109/TNNLS.2021.3119969",

language = "英语",

volume = "34",

pages = "5181--5188",

journal = "IEEE Transactions on Neural Networks and Learning Systems",

issn = "2162-237X",

publisher = "IEEE Computational Intelligence Society",

number = "8",

}

TY - JOUR

T1 - AudioVisual Video Summarization

AU - Zhao, Bin

AU - Gong, Maoguo

AU - Li, Xuelong

PY - 2023/8/1

Y1 - 2023/8/1

N2 - Audio and vision are two main modalities in video data. Multimodal learning, especially for audiovisual learning, has drawn considerable attention recently, which can boost the performance of various computer vision tasks. However, in video summarization, most existing approaches just exploit the visual information while neglecting the audio information. In this brief, we argue that the audio modality can assist vision modality to better understand the video content and structure and further benefit the summarization process. Motivated by this, we propose to jointly exploit the audio and visual information for the video summarization task and develop an audiovisual recurrent network (AVRN) to achieve this. Specifically, the proposed AVRN can be separated into three parts: 1) the two-stream long-short term memory (LSTM) is used to encode the audio and visual feature sequentially by capturing their temporal dependency; 2) the audiovisual fusion LSTM is used to fuse the two modalities by exploring the latent consistency between them; and 3) the self-attention video encoder is adopted to capture the global dependency in the video. Finally, the fused audiovisual information and the integrated temporal and global dependencies are jointly used to predict the video summary. Practically, the experimental results on the two benchmarks, i.e., SumMe and TVsum, have demonstrated the effectiveness of each part and the superiority of AVRN compared with those approaches just exploiting visual information for video summarization.

AB - Audio and vision are two main modalities in video data. Multimodal learning, especially for audiovisual learning, has drawn considerable attention recently, which can boost the performance of various computer vision tasks. However, in video summarization, most existing approaches just exploit the visual information while neglecting the audio information. In this brief, we argue that the audio modality can assist vision modality to better understand the video content and structure and further benefit the summarization process. Motivated by this, we propose to jointly exploit the audio and visual information for the video summarization task and develop an audiovisual recurrent network (AVRN) to achieve this. Specifically, the proposed AVRN can be separated into three parts: 1) the two-stream long-short term memory (LSTM) is used to encode the audio and visual feature sequentially by capturing their temporal dependency; 2) the audiovisual fusion LSTM is used to fuse the two modalities by exploring the latent consistency between them; and 3) the self-attention video encoder is adopted to capture the global dependency in the video. Finally, the fused audiovisual information and the integrated temporal and global dependencies are jointly used to predict the video summary. Practically, the experimental results on the two benchmarks, i.e., SumMe and TVsum, have demonstrated the effectiveness of each part and the superiority of AVRN compared with those approaches just exploiting visual information for video summarization.

KW - Audiovisual learning

KW - multimodal learning

KW - recurrent network

KW - video summarization

UR - http://www.scopus.com/inward/record.url?scp=85118555822&partnerID=8YFLogxK

U2 - 10.1109/TNNLS.2021.3119969

DO - 10.1109/TNNLS.2021.3119969

M3 - 文章

C2 - 34695009

AN - SCOPUS:85118555822

SN - 2162-237X

VL - 34

SP - 5181

EP - 5188

JO - IEEE Transactions on Neural Networks and Learning Systems

JF - IEEE Transactions on Neural Networks and Learning Systems

IS - 8

ER -

AudioVisual Video Summarization

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this