MAM-RNN: Multi-level attention model based RNN for video captioning

Xuelong Li, Bin Zhao, Xiaoqiang Lu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

99 Scopus citations

Abstract

Visual information is quite important for the task of video captioning. However, in the video, there are a lot of uncorrelated content, which may cause interference to generate a correct caption. Based on this point, we attempt to exploit the visual features which are most correlated to the caption. In this paper, a Multi-level Attention Model based Recurrent Neural Network (MAM-RNN) is proposed, where MAM is utilized to encode the visual feature and RNN works as the decoder to generate the video caption. During generation, the proposed approach is able to adaptively attend to the salient regions in the frame and the frames correlated to the caption. Practically, the experimental results on two benchmark datasets, i.e., MSVD and Charades, have shown the excellent performance of the proposed approach.

Original languageEnglish
Title of host publication26th International Joint Conference on Artificial Intelligence, IJCAI 2017
EditorsCarles Sierra
PublisherInternational Joint Conferences on Artificial Intelligence
Pages2208-2214
Number of pages7
ISBN (Electronic)9780999241103
DOIs
StatePublished - 2017
Event26th International Joint Conference on Artificial Intelligence, IJCAI 2017 - Melbourne, Australia
Duration: 19 Aug 201725 Aug 2017

Publication series

NameIJCAI International Joint Conference on Artificial Intelligence
Volume0
ISSN (Print)1045-0823

Conference

Conference26th International Joint Conference on Artificial Intelligence, IJCAI 2017
Country/TerritoryAustralia
CityMelbourne
Period19/08/1725/08/17

Fingerprint

Dive into the research topics of 'MAM-RNN: Multi-level attention model based RNN for video captioning'. Together they form a unique fingerprint.

Cite this