Video captioning with tube features

Bin Zhao, Xuelong Li, Xiaoqiang Lu

科研成果: 书/报告/会议事项章节会议稿件同行评审

31 引用 (Scopus)

摘要

Visual feature plays an important role in the video captioning task. Considering that the video content is mainly composed of the activities of salient objects, it has restricted the caption quality of current approaches which just focus on global frame features while paying less attention to the salient objects. To tackle this problem, in this paper, we design an object-aware feature for video captioning, denoted as tube feature. Firstly, Faster-RCNN is employed to extract object regions in frames, and a tube generation method is developed to connect the regions from different frames but belonging to the same object. After that, an encoder-decoder architecture is constructed for video caption generation. Specifically, the encoder is a bi-directional LSTM, which is utilized to capture the dynamic information of each tube. The decoder is a single LSTM extended with an attention model, which enables our approach to adaptively attend to the most correlated tubes when generating the caption. We evaluate our approach on two benchmark datasets: MSVD and Charades. The experimental results have demonstrated the effectiveness of tube feature in the video captioning task.

源语言英语
主期刊名Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI 2018
编辑Jerome Lang
出版商International Joint Conferences on Artificial Intelligence
1177-1183
页数7
ISBN(电子版)9780999241127
DOI
出版状态已出版 - 2018
活动27th International Joint Conference on Artificial Intelligence, IJCAI 2018 - Stockholm, 瑞典
期限: 13 7月 201819 7月 2018

出版系列

姓名IJCAI International Joint Conference on Artificial Intelligence
2018-July
ISSN(印刷版)1045-0823

会议

会议27th International Joint Conference on Artificial Intelligence, IJCAI 2018
国家/地区瑞典
Stockholm
时期13/07/1819/07/18

指纹

探究 'Video captioning with tube features' 的科研主题。它们共同构成独一无二的指纹。

引用此