Video captioning with tube features

Bin Zhao, Xuelong Li, Xiaoqiang Lu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

31 Scopus citations

Abstract

Visual feature plays an important role in the video captioning task. Considering that the video content is mainly composed of the activities of salient objects, it has restricted the caption quality of current approaches which just focus on global frame features while paying less attention to the salient objects. To tackle this problem, in this paper, we design an object-aware feature for video captioning, denoted as tube feature. Firstly, Faster-RCNN is employed to extract object regions in frames, and a tube generation method is developed to connect the regions from different frames but belonging to the same object. After that, an encoder-decoder architecture is constructed for video caption generation. Specifically, the encoder is a bi-directional LSTM, which is utilized to capture the dynamic information of each tube. The decoder is a single LSTM extended with an attention model, which enables our approach to adaptively attend to the most correlated tubes when generating the caption. We evaluate our approach on two benchmark datasets: MSVD and Charades. The experimental results have demonstrated the effectiveness of tube feature in the video captioning task.

Original languageEnglish
Title of host publicationProceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI 2018
EditorsJerome Lang
PublisherInternational Joint Conferences on Artificial Intelligence
Pages1177-1183
Number of pages7
ISBN (Electronic)9780999241127
DOIs
StatePublished - 2018
Event27th International Joint Conference on Artificial Intelligence, IJCAI 2018 - Stockholm, Sweden
Duration: 13 Jul 201819 Jul 2018

Publication series

NameIJCAI International Joint Conference on Artificial Intelligence
Volume2018-July
ISSN (Print)1045-0823

Conference

Conference27th International Joint Conference on Artificial Intelligence, IJCAI 2018
Country/TerritorySweden
CityStockholm
Period13/07/1819/07/18

Fingerprint

Dive into the research topics of 'Video captioning with tube features'. Together they form a unique fingerprint.

Cite this