Visual Saliency Transformer

Nian Liu; Ni Zhang; Kaiyuan Wan; Ling Shao; Junwei Han

doi:10.1109/ICCV48922.2021.00468

Visual Saliency Transformer

Nian Liu, Ni Zhang, Kaiyuan Wan, Ling Shao, Junwei Han

自动化学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

361 引用（Scopus）

摘要

Existing state-of-the-art saliency detection methods heavily rely on CNN-based architectures. Alternatively, we rethink this task from a convolution-free sequence-to-sequence perspective and predict saliency by modeling long-range dependencies, which can not be achieved by convolution. Specifically, we develop a novel unified model based on a pure transformer, namely, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD). It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Unlike conventional architectures used in Vision Transformer (ViT), we leverage multi-level token fusion and propose a new token upsampling method under the transformer framework to get high-resolution detection results. We also develop a token-based multi-task decoder to simultaneously perform saliency and boundary detection by introducing task-related tokens and a novel patch-task-attention mechanism. Experimental results show that our model outperforms existing methods on both RGB and RGB-D SOD benchmark datasets. Most importantly, our whole framework not only provides a new perspective for the SOD field but also shows a new paradigm for transformer-based dense prediction models. Code is available at https://github.com/nnizhang/VST.

源语言	英语
主期刊名	Proceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021
出版商	Institute of Electrical and Electronics Engineers Inc.
页	4702-4712
页数	11
ISBN（电子版）	9781665428125
DOI	https://doi.org/10.1109/ICCV48922.2021.00468
出版状态	已出版 - 2021
活动	18th IEEE/CVF International Conference on Computer Vision, ICCV 2021 - Virtual, Online, 加拿大期限: 11 10月 2021 → 17 10月 2021

出版系列

姓名	Proceedings of the IEEE International Conference on Computer Vision
ISSN（印刷版）	1550-5499

会议

会议	18th IEEE/CVF International Conference on Computer Vision, ICCV 2021
国家/地区	加拿大
市	Virtual, Online
时期	11/10/21 → 17/10/21

访问文件

10.1109/ICCV48922.2021.00468

其它文件与链接

链接到 Scopus 的出版物

引用此

@inproceedings{d8ec720d695945d38e0d0cbee24dc380,

title = "Visual Saliency Transformer",

abstract = "Existing state-of-the-art saliency detection methods heavily rely on CNN-based architectures. Alternatively, we rethink this task from a convolution-free sequence-to-sequence perspective and predict saliency by modeling long-range dependencies, which can not be achieved by convolution. Specifically, we develop a novel unified model based on a pure transformer, namely, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD). It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Unlike conventional architectures used in Vision Transformer (ViT), we leverage multi-level token fusion and propose a new token upsampling method under the transformer framework to get high-resolution detection results. We also develop a token-based multi-task decoder to simultaneously perform saliency and boundary detection by introducing task-related tokens and a novel patch-task-attention mechanism. Experimental results show that our model outperforms existing methods on both RGB and RGB-D SOD benchmark datasets. Most importantly, our whole framework not only provides a new perspective for the SOD field but also shows a new paradigm for transformer-based dense prediction models. Code is available at https://github.com/nnizhang/VST.",

author = "Nian Liu and Ni Zhang and Kaiyuan Wan and Ling Shao and Junwei Han",

note = "Publisher Copyright: {\textcopyright} 2021 IEEE; 18th IEEE/CVF International Conference on Computer Vision, ICCV 2021 ; Conference date: 11-10-2021 Through 17-10-2021",

year = "2021",

doi = "10.1109/ICCV48922.2021.00468",

language = "英语",

series = "Proceedings of the IEEE International Conference on Computer Vision",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "4702--4712",

booktitle = "Proceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021",

}

Liu, N, Zhang, N, Wan, K, Shao, L & Han, J 2021, Visual Saliency Transformer. 在 Proceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021. Proceedings of the IEEE International Conference on Computer Vision, Institute of Electrical and Electronics Engineers Inc., 页码 4702-4712, 18th IEEE/CVF International Conference on Computer Vision, ICCV 2021, Virtual, Online, 加拿大, 11/10/21. https://doi.org/10.1109/ICCV48922.2021.00468

Visual Saliency Transformer. / Liu, Nian; Zhang, Ni; Wan, Kaiyuan 等.
Proceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021. Institute of Electrical and Electronics Engineers Inc., 2021. 页码 4702-4712 (Proceedings of the IEEE International Conference on Computer Vision).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Visual Saliency Transformer

AU - Liu, Nian

AU - Zhang, Ni

AU - Wan, Kaiyuan

AU - Shao, Ling

AU - Han, Junwei

PY - 2021

Y1 - 2021

N2 - Existing state-of-the-art saliency detection methods heavily rely on CNN-based architectures. Alternatively, we rethink this task from a convolution-free sequence-to-sequence perspective and predict saliency by modeling long-range dependencies, which can not be achieved by convolution. Specifically, we develop a novel unified model based on a pure transformer, namely, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD). It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Unlike conventional architectures used in Vision Transformer (ViT), we leverage multi-level token fusion and propose a new token upsampling method under the transformer framework to get high-resolution detection results. We also develop a token-based multi-task decoder to simultaneously perform saliency and boundary detection by introducing task-related tokens and a novel patch-task-attention mechanism. Experimental results show that our model outperforms existing methods on both RGB and RGB-D SOD benchmark datasets. Most importantly, our whole framework not only provides a new perspective for the SOD field but also shows a new paradigm for transformer-based dense prediction models. Code is available at https://github.com/nnizhang/VST.

AB - Existing state-of-the-art saliency detection methods heavily rely on CNN-based architectures. Alternatively, we rethink this task from a convolution-free sequence-to-sequence perspective and predict saliency by modeling long-range dependencies, which can not be achieved by convolution. Specifically, we develop a novel unified model based on a pure transformer, namely, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD). It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Unlike conventional architectures used in Vision Transformer (ViT), we leverage multi-level token fusion and propose a new token upsampling method under the transformer framework to get high-resolution detection results. We also develop a token-based multi-task decoder to simultaneously perform saliency and boundary detection by introducing task-related tokens and a novel patch-task-attention mechanism. Experimental results show that our model outperforms existing methods on both RGB and RGB-D SOD benchmark datasets. Most importantly, our whole framework not only provides a new perspective for the SOD field but also shows a new paradigm for transformer-based dense prediction models. Code is available at https://github.com/nnizhang/VST.

UR - http://www.scopus.com/inward/record.url?scp=85127809713&partnerID=8YFLogxK

U2 - 10.1109/ICCV48922.2021.00468

DO - 10.1109/ICCV48922.2021.00468

M3 - 会议稿件

AN - SCOPUS:85127809713

T3 - Proceedings of the IEEE International Conference on Computer Vision

SP - 4702

EP - 4712

BT - Proceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 18th IEEE/CVF International Conference on Computer Vision, ICCV 2021

Y2 - 11 October 2021 through 17 October 2021

ER -

Visual Saliency Transformer

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此