Video Captioning with Semantic Guiding

Jin Yuan; Chunna Tian; Xiangnan Zhang; Yuxuan Ding; Wei Wei

doi:10.1109/BigMM.2018.8499357

Video Captioning with Semantic Guiding

Jin Yuan, Chunna Tian, Xiangnan Zhang, Yuxuan Ding, Wei Wei

School of Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

14 Scopus citations

Abstract

Video captioning is to generate descriptions of videos. Most existing approaches adopt the encoder-decoder architecture, which usually use different kinds of visual features, such as temporal features and motion features, but they neglect the abundant semantic information in the video. To address this issue, we propose a framework that jointly explores visual features and semantic attributions named Semantic Guiding Long Short-Term Memory (SG-LSTM). The proposed SG-LSTM has two semantic guiding layers, both of them use three types of semantic - global semantic, object semantic and verb semantic - attributes to guide language model to use the most relevant representation to generate sentences. We evaluate our method on the public available challenging Youtube2Text dataset. Experimental results shown that our framework outperforms the state-of-the-art methods.

Original language	English
Title of host publication	2018 IEEE 4th International Conference on Multimedia Big Data, BigMM 2018
Publisher	Institute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)	9781538653210
DOIs	https://doi.org/10.1109/BigMM.2018.8499357
State	Published - 18 Oct 2018
Event	4th IEEE International Conference on Multimedia Big Data, BigMM 2018 - Xi'an, China Duration: 13 Sep 2018 → 16 Sep 2018

Publication series

Name	2018 IEEE 4th International Conference on Multimedia Big Data, BigMM 2018

Conference

Conference	4th IEEE International Conference on Multimedia Big Data, BigMM 2018
Country/Territory	China
City	Xi'an
Period	13/09/18 → 16/09/18

Keywords

neural network
semantic attributes
sequence learning
Video captioning

Access to Document

10.1109/BigMM.2018.8499357

Cite this

@inproceedings{16ffb1783c614445b6471dead28d9bc8,

title = "Video Captioning with Semantic Guiding",

abstract = "Video captioning is to generate descriptions of videos. Most existing approaches adopt the encoder-decoder architecture, which usually use different kinds of visual features, such as temporal features and motion features, but they neglect the abundant semantic information in the video. To address this issue, we propose a framework that jointly explores visual features and semantic attributions named Semantic Guiding Long Short-Term Memory (SG-LSTM). The proposed SG-LSTM has two semantic guiding layers, both of them use three types of semantic - global semantic, object semantic and verb semantic - attributes to guide language model to use the most relevant representation to generate sentences. We evaluate our method on the public available challenging Youtube2Text dataset. Experimental results shown that our framework outperforms the state-of-the-art methods.",

keywords = "neural network, semantic attributes, sequence learning, Video captioning",

author = "Jin Yuan and Chunna Tian and Xiangnan Zhang and Yuxuan Ding and Wei Wei",

note = "Publisher Copyright: {\textcopyright} 2018 IEEE.; 4th IEEE International Conference on Multimedia Big Data, BigMM 2018 ; Conference date: 13-09-2018 Through 16-09-2018",

year = "2018",

month = oct,

day = "18",

doi = "10.1109/BigMM.2018.8499357",

language = "英语",

series = "2018 IEEE 4th International Conference on Multimedia Big Data, BigMM 2018",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "2018 IEEE 4th International Conference on Multimedia Big Data, BigMM 2018",

}

Yuan, J, Tian, C, Zhang, X, Ding, Y & Wei, W 2018, Video Captioning with Semantic Guiding. in 2018 IEEE 4th International Conference on Multimedia Big Data, BigMM 2018., 8499357, 2018 IEEE 4th International Conference on Multimedia Big Data, BigMM 2018, Institute of Electrical and Electronics Engineers Inc., 4th IEEE International Conference on Multimedia Big Data, BigMM 2018, Xi'an, China, 13/09/18. https://doi.org/10.1109/BigMM.2018.8499357

Video Captioning with Semantic Guiding. / Yuan, Jin; Tian, Chunna; Zhang, Xiangnan et al.
2018 IEEE 4th International Conference on Multimedia Big Data, BigMM 2018. Institute of Electrical and Electronics Engineers Inc., 2018. 8499357 (2018 IEEE 4th International Conference on Multimedia Big Data, BigMM 2018).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Video Captioning with Semantic Guiding

AU - Yuan, Jin

AU - Tian, Chunna

AU - Zhang, Xiangnan

AU - Ding, Yuxuan

AU - Wei, Wei

PY - 2018/10/18

Y1 - 2018/10/18

N2 - Video captioning is to generate descriptions of videos. Most existing approaches adopt the encoder-decoder architecture, which usually use different kinds of visual features, such as temporal features and motion features, but they neglect the abundant semantic information in the video. To address this issue, we propose a framework that jointly explores visual features and semantic attributions named Semantic Guiding Long Short-Term Memory (SG-LSTM). The proposed SG-LSTM has two semantic guiding layers, both of them use three types of semantic - global semantic, object semantic and verb semantic - attributes to guide language model to use the most relevant representation to generate sentences. We evaluate our method on the public available challenging Youtube2Text dataset. Experimental results shown that our framework outperforms the state-of-the-art methods.

AB - Video captioning is to generate descriptions of videos. Most existing approaches adopt the encoder-decoder architecture, which usually use different kinds of visual features, such as temporal features and motion features, but they neglect the abundant semantic information in the video. To address this issue, we propose a framework that jointly explores visual features and semantic attributions named Semantic Guiding Long Short-Term Memory (SG-LSTM). The proposed SG-LSTM has two semantic guiding layers, both of them use three types of semantic - global semantic, object semantic and verb semantic - attributes to guide language model to use the most relevant representation to generate sentences. We evaluate our method on the public available challenging Youtube2Text dataset. Experimental results shown that our framework outperforms the state-of-the-art methods.

KW - neural network

KW - semantic attributes

KW - sequence learning

KW - Video captioning

UR - http://www.scopus.com/inward/record.url?scp=85057073142&partnerID=8YFLogxK

U2 - 10.1109/BigMM.2018.8499357

DO - 10.1109/BigMM.2018.8499357

M3 - 会议稿件

AN - SCOPUS:85057073142

T3 - 2018 IEEE 4th International Conference on Multimedia Big Data, BigMM 2018

BT - 2018 IEEE 4th International Conference on Multimedia Big Data, BigMM 2018

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 4th IEEE International Conference on Multimedia Big Data, BigMM 2018

Y2 - 13 September 2018 through 16 September 2018

ER -

Video Captioning with Semantic Guiding

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this