Dilated temporal relational adversarial network for generic video summarization

Yujia Zhang; Michael Kampffmeyer; Xiaodan Liang; Dingwen Zhang; Min Tan; Eric P. Xing

doi:10.1007/s11042-019-08175-y

Dilated temporal relational adversarial network for generic video summarization

Yujia Zhang, Michael Kampffmeyer, Xiaodan Liang, Dingwen Zhang, Min Tan, Eric P. Xing

Research output: Contribution to journal › Article › peer-review

11 Scopus citations

Abstract

The large amount of videos popping up every day, make it more and more critical that key information within videos can be extracted and understood in a very short time. Video summarization, the task of finding the smallest subset of frames, which still conveys the whole story of a given video, is thus of great significance to improve efficiency of video understanding. We propose a novel Dilated Temporal Relational Generative Adversarial Network (DTR-GAN) to achieve frame-level video summarization. Given a video, it selects the set of key frames, which contain the most meaningful and compact information. Specifically, DTR-GAN learns a dilated temporal relational generator and a discriminator with three-player loss in an adversarial manner. A new dilated temporal relation (DTR) unit is introduced to enhance temporal representation capturing. The generator uses this unit to effectively exploit global multi-scale temporal context to select key frames and to complement the commonly used Bi-LSTM. To ensure that summaries capture enough key video representation from a global perspective rather than a trivial randomly shorten sequence, we present a discriminator that learns to enforce both the information completeness and compactness of summaries via a three-player loss. The loss includes the generated summary loss, the random summary loss, and the real summary (ground-truth) loss, which play important roles for better regularizing the learned model to obtain useful summaries. Comprehensive experiments on three public datasets show the effectiveness of the proposed approach.

Original language	English
Pages (from-to)	35237-35261
Number of pages	25
Journal	Multimedia Tools and Applications
Volume	78
Issue number	24
DOIs	https://doi.org/10.1007/s11042-019-08175-y
State	Published - 1 Dec 2019
Externally published	Yes

Keywords

Dilated temporal relation
Generative adversarial network
Three-player loss
Video summarization

Access to Document

10.1007/s11042-019-08175-y

Cite this

@article{fcd614f863f7445181732500029adab3,

title = "Dilated temporal relational adversarial network for generic video summarization",

abstract = "The large amount of videos popping up every day, make it more and more critical that key information within videos can be extracted and understood in a very short time. Video summarization, the task of finding the smallest subset of frames, which still conveys the whole story of a given video, is thus of great significance to improve efficiency of video understanding. We propose a novel Dilated Temporal Relational Generative Adversarial Network (DTR-GAN) to achieve frame-level video summarization. Given a video, it selects the set of key frames, which contain the most meaningful and compact information. Specifically, DTR-GAN learns a dilated temporal relational generator and a discriminator with three-player loss in an adversarial manner. A new dilated temporal relation (DTR) unit is introduced to enhance temporal representation capturing. The generator uses this unit to effectively exploit global multi-scale temporal context to select key frames and to complement the commonly used Bi-LSTM. To ensure that summaries capture enough key video representation from a global perspective rather than a trivial randomly shorten sequence, we present a discriminator that learns to enforce both the information completeness and compactness of summaries via a three-player loss. The loss includes the generated summary loss, the random summary loss, and the real summary (ground-truth) loss, which play important roles for better regularizing the learned model to obtain useful summaries. Comprehensive experiments on three public datasets show the effectiveness of the proposed approach.",

keywords = "Dilated temporal relation, Generative adversarial network, Three-player loss, Video summarization",

author = "Yujia Zhang and Michael Kampffmeyer and Xiaodan Liang and Dingwen Zhang and Min Tan and Xing, {Eric P.}",

note = "Publisher Copyright: {\textcopyright} 2019, Springer Science+Business Media, LLC, part of Springer Nature.",

year = "2019",

month = dec,

day = "1",

doi = "10.1007/s11042-019-08175-y",

language = "英语",

volume = "78",

pages = "35237--35261",

journal = "Multimedia Tools and Applications",

issn = "1380-7501",

publisher = "Springer",

number = "24",

}

TY - JOUR

T1 - Dilated temporal relational adversarial network for generic video summarization

AU - Zhang, Yujia

AU - Kampffmeyer, Michael

AU - Liang, Xiaodan

AU - Zhang, Dingwen

AU - Tan, Min

AU - Xing, Eric P.

PY - 2019/12/1

Y1 - 2019/12/1

N2 - The large amount of videos popping up every day, make it more and more critical that key information within videos can be extracted and understood in a very short time. Video summarization, the task of finding the smallest subset of frames, which still conveys the whole story of a given video, is thus of great significance to improve efficiency of video understanding. We propose a novel Dilated Temporal Relational Generative Adversarial Network (DTR-GAN) to achieve frame-level video summarization. Given a video, it selects the set of key frames, which contain the most meaningful and compact information. Specifically, DTR-GAN learns a dilated temporal relational generator and a discriminator with three-player loss in an adversarial manner. A new dilated temporal relation (DTR) unit is introduced to enhance temporal representation capturing. The generator uses this unit to effectively exploit global multi-scale temporal context to select key frames and to complement the commonly used Bi-LSTM. To ensure that summaries capture enough key video representation from a global perspective rather than a trivial randomly shorten sequence, we present a discriminator that learns to enforce both the information completeness and compactness of summaries via a three-player loss. The loss includes the generated summary loss, the random summary loss, and the real summary (ground-truth) loss, which play important roles for better regularizing the learned model to obtain useful summaries. Comprehensive experiments on three public datasets show the effectiveness of the proposed approach.

AB - The large amount of videos popping up every day, make it more and more critical that key information within videos can be extracted and understood in a very short time. Video summarization, the task of finding the smallest subset of frames, which still conveys the whole story of a given video, is thus of great significance to improve efficiency of video understanding. We propose a novel Dilated Temporal Relational Generative Adversarial Network (DTR-GAN) to achieve frame-level video summarization. Given a video, it selects the set of key frames, which contain the most meaningful and compact information. Specifically, DTR-GAN learns a dilated temporal relational generator and a discriminator with three-player loss in an adversarial manner. A new dilated temporal relation (DTR) unit is introduced to enhance temporal representation capturing. The generator uses this unit to effectively exploit global multi-scale temporal context to select key frames and to complement the commonly used Bi-LSTM. To ensure that summaries capture enough key video representation from a global perspective rather than a trivial randomly shorten sequence, we present a discriminator that learns to enforce both the information completeness and compactness of summaries via a three-player loss. The loss includes the generated summary loss, the random summary loss, and the real summary (ground-truth) loss, which play important roles for better regularizing the learned model to obtain useful summaries. Comprehensive experiments on three public datasets show the effectiveness of the proposed approach.

KW - Dilated temporal relation

KW - Generative adversarial network

KW - Three-player loss

KW - Video summarization

UR - http://www.scopus.com/inward/record.url?scp=85075739820&partnerID=8YFLogxK

U2 - 10.1007/s11042-019-08175-y

DO - 10.1007/s11042-019-08175-y

M3 - 文章

AN - SCOPUS:85075739820

SN - 1380-7501

VL - 78

SP - 35237

EP - 35261

JO - Multimedia Tools and Applications

JF - Multimedia Tools and Applications

IS - 24

ER -

Dilated temporal relational adversarial network for generic video summarization

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this