Video captioning with tube features

Bin Zhao; Xuelong Li; Xiaoqiang Lu

doi:10.24963/ijcai.2018/164

Video captioning with tube features

Bin Zhao, Xuelong Li, Xiaoqiang Lu

光电与智能研究院

CAS - Xi'an Institute of Optics and Precision Mechanics

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

31 引用（Scopus）

摘要

Visual feature plays an important role in the video captioning task. Considering that the video content is mainly composed of the activities of salient objects, it has restricted the caption quality of current approaches which just focus on global frame features while paying less attention to the salient objects. To tackle this problem, in this paper, we design an object-aware feature for video captioning, denoted as tube feature. Firstly, Faster-RCNN is employed to extract object regions in frames, and a tube generation method is developed to connect the regions from different frames but belonging to the same object. After that, an encoder-decoder architecture is constructed for video caption generation. Specifically, the encoder is a bi-directional LSTM, which is utilized to capture the dynamic information of each tube. The decoder is a single LSTM extended with an attention model, which enables our approach to adaptively attend to the most correlated tubes when generating the caption. We evaluate our approach on two benchmark datasets: MSVD and Charades. The experimental results have demonstrated the effectiveness of tube feature in the video captioning task.

源语言	英语
主期刊名	Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI 2018
编辑	Jerome Lang
出版商	International Joint Conferences on Artificial Intelligence
页	1177-1183
页数	7
ISBN（电子版）	9780999241127
DOI	https://doi.org/10.24963/ijcai.2018/164
出版状态	已出版 - 2018
活动	27th International Joint Conference on Artificial Intelligence, IJCAI 2018 - Stockholm, 瑞典期限: 13 7月 2018 → 19 7月 2018

出版系列

姓名	IJCAI International Joint Conference on Artificial Intelligence
卷	2018-July
ISSN（印刷版）	1045-0823

会议

会议	27th International Joint Conference on Artificial Intelligence, IJCAI 2018
国家/地区	瑞典
市	Stockholm
时期	13/07/18 → 19/07/18

访问文件

10.24963/ijcai.2018/164

其它文件与链接

链接到 Scopus 的出版物

引用此

Zhao, B., Li, X., & Lu, X. (2018). Video captioning with tube features. 在 J. Lang (编辑), Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI 2018 (页码 1177-1183). (IJCAI International Joint Conference on Artificial Intelligence; 卷 2018-July). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2018/164

@inproceedings{11209797d8fa4d8fa0d66491ddb1449b,

title = "Video captioning with tube features",

abstract = "Visual feature plays an important role in the video captioning task. Considering that the video content is mainly composed of the activities of salient objects, it has restricted the caption quality of current approaches which just focus on global frame features while paying less attention to the salient objects. To tackle this problem, in this paper, we design an object-aware feature for video captioning, denoted as tube feature. Firstly, Faster-RCNN is employed to extract object regions in frames, and a tube generation method is developed to connect the regions from different frames but belonging to the same object. After that, an encoder-decoder architecture is constructed for video caption generation. Specifically, the encoder is a bi-directional LSTM, which is utilized to capture the dynamic information of each tube. The decoder is a single LSTM extended with an attention model, which enables our approach to adaptively attend to the most correlated tubes when generating the caption. We evaluate our approach on two benchmark datasets: MSVD and Charades. The experimental results have demonstrated the effectiveness of tube feature in the video captioning task.",

author = "Bin Zhao and Xuelong Li and Xiaoqiang Lu",

note = "Publisher Copyright: {\textcopyright} 2018 International Joint Conferences on Artificial Intelligence. All right reserved.; 27th International Joint Conference on Artificial Intelligence, IJCAI 2018 ; Conference date: 13-07-2018 Through 19-07-2018",

year = "2018",

doi = "10.24963/ijcai.2018/164",

language = "英语",

series = "IJCAI International Joint Conference on Artificial Intelligence",

publisher = "International Joint Conferences on Artificial Intelligence",

pages = "1177--1183",

editor = "Jerome Lang",

booktitle = "Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI 2018",

}

Zhao, B, Li, X & Lu, X 2018, Video captioning with tube features. 在 J Lang (编辑), Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI 2018. IJCAI International Joint Conference on Artificial Intelligence, 卷 2018-July, International Joint Conferences on Artificial Intelligence, 页码 1177-1183, 27th International Joint Conference on Artificial Intelligence, IJCAI 2018, Stockholm, 瑞典, 13/07/18. https://doi.org/10.24963/ijcai.2018/164

Video captioning with tube features. / Zhao, Bin; Li, Xuelong; Lu, Xiaoqiang.
Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI 2018. 编辑 / Jerome Lang. International Joint Conferences on Artificial Intelligence, 2018. 页码 1177-1183 (IJCAI International Joint Conference on Artificial Intelligence; 卷 2018-July).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Video captioning with tube features

AU - Zhao, Bin

AU - Li, Xuelong

AU - Lu, Xiaoqiang

PY - 2018

Y1 - 2018

N2 - Visual feature plays an important role in the video captioning task. Considering that the video content is mainly composed of the activities of salient objects, it has restricted the caption quality of current approaches which just focus on global frame features while paying less attention to the salient objects. To tackle this problem, in this paper, we design an object-aware feature for video captioning, denoted as tube feature. Firstly, Faster-RCNN is employed to extract object regions in frames, and a tube generation method is developed to connect the regions from different frames but belonging to the same object. After that, an encoder-decoder architecture is constructed for video caption generation. Specifically, the encoder is a bi-directional LSTM, which is utilized to capture the dynamic information of each tube. The decoder is a single LSTM extended with an attention model, which enables our approach to adaptively attend to the most correlated tubes when generating the caption. We evaluate our approach on two benchmark datasets: MSVD and Charades. The experimental results have demonstrated the effectiveness of tube feature in the video captioning task.

AB - Visual feature plays an important role in the video captioning task. Considering that the video content is mainly composed of the activities of salient objects, it has restricted the caption quality of current approaches which just focus on global frame features while paying less attention to the salient objects. To tackle this problem, in this paper, we design an object-aware feature for video captioning, denoted as tube feature. Firstly, Faster-RCNN is employed to extract object regions in frames, and a tube generation method is developed to connect the regions from different frames but belonging to the same object. After that, an encoder-decoder architecture is constructed for video caption generation. Specifically, the encoder is a bi-directional LSTM, which is utilized to capture the dynamic information of each tube. The decoder is a single LSTM extended with an attention model, which enables our approach to adaptively attend to the most correlated tubes when generating the caption. We evaluate our approach on two benchmark datasets: MSVD and Charades. The experimental results have demonstrated the effectiveness of tube feature in the video captioning task.

UR - http://www.scopus.com/inward/record.url?scp=85055698564&partnerID=8YFLogxK

U2 - 10.24963/ijcai.2018/164

DO - 10.24963/ijcai.2018/164

M3 - 会议稿件

AN - SCOPUS:85055698564

T3 - IJCAI International Joint Conference on Artificial Intelligence

SP - 1177

EP - 1183

BT - Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI 2018

A2 - Lang, Jerome

PB - International Joint Conferences on Artificial Intelligence

T2 - 27th International Joint Conference on Artificial Intelligence, IJCAI 2018

Y2 - 13 July 2018 through 19 July 2018

ER -

Video captioning with tube features

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此