Video Captioning with Semantic Guiding

Jin Yuan, Chunna Tian, Xiangnan Zhang, Yuxuan Ding, Wei Wei

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

14 Scopus citations

Abstract

Video captioning is to generate descriptions of videos. Most existing approaches adopt the encoder-decoder architecture, which usually use different kinds of visual features, such as temporal features and motion features, but they neglect the abundant semantic information in the video. To address this issue, we propose a framework that jointly explores visual features and semantic attributions named Semantic Guiding Long Short-Term Memory (SG-LSTM). The proposed SG-LSTM has two semantic guiding layers, both of them use three types of semantic - global semantic, object semantic and verb semantic - attributes to guide language model to use the most relevant representation to generate sentences. We evaluate our method on the public available challenging Youtube2Text dataset. Experimental results shown that our framework outperforms the state-of-the-art methods.

Original languageEnglish
Title of host publication2018 IEEE 4th International Conference on Multimedia Big Data, BigMM 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781538653210
DOIs
StatePublished - 18 Oct 2018
Event4th IEEE International Conference on Multimedia Big Data, BigMM 2018 - Xi'an, China
Duration: 13 Sep 201816 Sep 2018

Publication series

Name2018 IEEE 4th International Conference on Multimedia Big Data, BigMM 2018

Conference

Conference4th IEEE International Conference on Multimedia Big Data, BigMM 2018
Country/TerritoryChina
CityXi'an
Period13/09/1816/09/18

Keywords

  • neural network
  • semantic attributes
  • sequence learning
  • Video captioning

Fingerprint

Dive into the research topics of 'Video Captioning with Semantic Guiding'. Together they form a unique fingerprint.

Cite this