Attentive Linear Transformation for Image Captioning

Senmao Ye; Junwei Han; Nian Liu

doi:10.1109/TIP.2018.2855406

Attentive Linear Transformation for Image Captioning

Senmao Ye, Junwei Han, Nian Liu

自动化学院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

70 引用（Scopus）

摘要

We propose a novel attention framework called attentive linear transformation (ALT) for automatic generation of image captions. Instead of learning the spatial or channel-wise attention in existing models, ALT learns to attend to the high-dimensional transformation matrix from the image feature space to the context vector space. Thus ALT can learn various relevant feature abstractions, including spatial attention, channel-wise attention, and visual dependence. Besides, we propose a soft threshold regression to predict the spatial attention probabilities. It preserves more relevant local regions than popular softmax regression. Extensive experiments on the MS COCO and the Flickr30k data sets all demonstrate the superiority of our model compared with other state-of-the-art models.

源语言	英语
文章编号	8410621
页（从-至）	5514-5524
页数	11
期刊	IEEE Transactions on Image Processing
卷	27
期	11
DOI	https://doi.org/10.1109/TIP.2018.2855406
出版状态	已出版 - 11月 2018

访问文件

10.1109/TIP.2018.2855406

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{2955af505b22449a8a13aa84d379b89e,

title = "Attentive Linear Transformation for Image Captioning",

abstract = "We propose a novel attention framework called attentive linear transformation (ALT) for automatic generation of image captions. Instead of learning the spatial or channel-wise attention in existing models, ALT learns to attend to the high-dimensional transformation matrix from the image feature space to the context vector space. Thus ALT can learn various relevant feature abstractions, including spatial attention, channel-wise attention, and visual dependence. Besides, we propose a soft threshold regression to predict the spatial attention probabilities. It preserves more relevant local regions than popular softmax regression. Extensive experiments on the MS COCO and the Flickr30k data sets all demonstrate the superiority of our model compared with other state-of-the-art models.",

keywords = "attention, CNN, Image captioning, linear transformation, LSTM",

author = "Senmao Ye and Junwei Han and Nian Liu",

note = "Publisher Copyright: {\textcopyright} 1992-2012 IEEE.",

year = "2018",

month = nov,

doi = "10.1109/TIP.2018.2855406",

language = "英语",

volume = "27",

pages = "5514--5524",

journal = "IEEE Transactions on Image Processing",

issn = "1057-7149",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "11",

}

TY - JOUR

T1 - Attentive Linear Transformation for Image Captioning

AU - Ye, Senmao

AU - Han, Junwei

AU - Liu, Nian

PY - 2018/11

Y1 - 2018/11

N2 - We propose a novel attention framework called attentive linear transformation (ALT) for automatic generation of image captions. Instead of learning the spatial or channel-wise attention in existing models, ALT learns to attend to the high-dimensional transformation matrix from the image feature space to the context vector space. Thus ALT can learn various relevant feature abstractions, including spatial attention, channel-wise attention, and visual dependence. Besides, we propose a soft threshold regression to predict the spatial attention probabilities. It preserves more relevant local regions than popular softmax regression. Extensive experiments on the MS COCO and the Flickr30k data sets all demonstrate the superiority of our model compared with other state-of-the-art models.

AB - We propose a novel attention framework called attentive linear transformation (ALT) for automatic generation of image captions. Instead of learning the spatial or channel-wise attention in existing models, ALT learns to attend to the high-dimensional transformation matrix from the image feature space to the context vector space. Thus ALT can learn various relevant feature abstractions, including spatial attention, channel-wise attention, and visual dependence. Besides, we propose a soft threshold regression to predict the spatial attention probabilities. It preserves more relevant local regions than popular softmax regression. Extensive experiments on the MS COCO and the Flickr30k data sets all demonstrate the superiority of our model compared with other state-of-the-art models.

KW - attention

KW - CNN

KW - Image captioning

KW - linear transformation

KW - LSTM

UR - http://www.scopus.com/inward/record.url?scp=85049942561&partnerID=8YFLogxK

U2 - 10.1109/TIP.2018.2855406

DO - 10.1109/TIP.2018.2855406

M3 - 文章

AN - SCOPUS:85049942561

SN - 1057-7149

VL - 27

SP - 5514

EP - 5524

JO - IEEE Transactions on Image Processing

JF - IEEE Transactions on Image Processing

IS - 11

M1 - 8410621

ER -

Attentive Linear Transformation for Image Captioning

摘要

访问文件

其它文件与链接

指纹

引用此