VST++: Efficient and Stronger Visual Saliency Transformer

Nian Liu; Ziyang Luo; Ni Zhang; Junwei Han

doi:10.1109/TPAMI.2024.3388153

VST++: Efficient and Stronger Visual Saliency Transformer

Nian Liu, Ziyang Luo, Ni Zhang, Junwei Han

School of Automation

Research output: Contribution to journal › Article › peer-review

17 Scopus citations

Abstract

While previous CNN-based models have exhibited promising results for salient object detection (SOD), their ability to explore global long-range dependencies is restricted. Our previous work, the Visual Saliency Transformer (VST), addressed this constraint from a transformer-based sequence-to-sequence perspective, to unify RGB and RGB-D SOD. In VST, we developed a multi-task transformer decoder that concurrently predicts saliency and boundary outcomes in a pure transformer architecture. Moreover, we introduced a novel token upsampling method called reverse T2T for predicting a high-resolution saliency map effortlessly within transformer-based structures. Building upon the VST model, we further propose an efficient and stronger VST version in this work, i.e.VST++. To mitigate the computational costs of the VST model, we propose a Select-Integrate Attention (SIA) module, partitioning foreground into fine-grained segments and aggregating background information into a single coarse-grained token. To incorporate 3D depth information with low cost, we design a novel depth position encoding method tailored for depth maps. Furthermore, we introduce a token-supervised prediction loss to provide straightforward guidance for the task-related tokens. We evaluate our VST++ model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets. Experimental results show that our model outperforms existing methods while achieving a 25% reduction in computational costs without significant performance compromise. The demonstrated strong ability for generalization, enhanced performance, and heightened efficiency of our VST++ model highlight its potential.

Original language	English
Pages (from-to)	7300-7316
Number of pages	17
Journal	IEEE Transactions on Pattern Analysis and Machine Intelligence
Volume	46
Issue number	11
DOIs	https://doi.org/10.1109/TPAMI.2024.3388153
State	Published - 2024

Keywords

Multi-task learning
RGB-D saliency detection
RGB-T saliency detection
saliency detection
transformer

Access to Document

10.1109/TPAMI.2024.3388153

Cite this

@article{351ac5c7fb884be0ba0e327dfc07abb8,

title = "VST++: Efficient and Stronger Visual Saliency Transformer",

abstract = "While previous CNN-based models have exhibited promising results for salient object detection (SOD), their ability to explore global long-range dependencies is restricted. Our previous work, the Visual Saliency Transformer (VST), addressed this constraint from a transformer-based sequence-to-sequence perspective, to unify RGB and RGB-D SOD. In VST, we developed a multi-task transformer decoder that concurrently predicts saliency and boundary outcomes in a pure transformer architecture. Moreover, we introduced a novel token upsampling method called reverse T2T for predicting a high-resolution saliency map effortlessly within transformer-based structures. Building upon the VST model, we further propose an efficient and stronger VST version in this work, i.e.VST++. To mitigate the computational costs of the VST model, we propose a Select-Integrate Attention (SIA) module, partitioning foreground into fine-grained segments and aggregating background information into a single coarse-grained token. To incorporate 3D depth information with low cost, we design a novel depth position encoding method tailored for depth maps. Furthermore, we introduce a token-supervised prediction loss to provide straightforward guidance for the task-related tokens. We evaluate our VST++ model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets. Experimental results show that our model outperforms existing methods while achieving a 25% reduction in computational costs without significant performance compromise. The demonstrated strong ability for generalization, enhanced performance, and heightened efficiency of our VST++ model highlight its potential.",

keywords = "Multi-task learning, RGB-D saliency detection, RGB-T saliency detection, saliency detection, transformer",

author = "Nian Liu and Ziyang Luo and Ni Zhang and Junwei Han",

note = "Publisher Copyright: {\textcopyright} 1979-2012 IEEE.",

year = "2024",

doi = "10.1109/TPAMI.2024.3388153",

language = "英语",

volume = "46",

pages = "7300--7316",

journal = "IEEE Transactions on Pattern Analysis and Machine Intelligence",

issn = "0162-8828",

publisher = "IEEE Computer Society",

number = "11",

}

TY - JOUR

T1 - VST++

T2 - Efficient and Stronger Visual Saliency Transformer

AU - Liu, Nian

AU - Luo, Ziyang

AU - Zhang, Ni

AU - Han, Junwei

PY - 2024

Y1 - 2024

N2 - While previous CNN-based models have exhibited promising results for salient object detection (SOD), their ability to explore global long-range dependencies is restricted. Our previous work, the Visual Saliency Transformer (VST), addressed this constraint from a transformer-based sequence-to-sequence perspective, to unify RGB and RGB-D SOD. In VST, we developed a multi-task transformer decoder that concurrently predicts saliency and boundary outcomes in a pure transformer architecture. Moreover, we introduced a novel token upsampling method called reverse T2T for predicting a high-resolution saliency map effortlessly within transformer-based structures. Building upon the VST model, we further propose an efficient and stronger VST version in this work, i.e.VST++. To mitigate the computational costs of the VST model, we propose a Select-Integrate Attention (SIA) module, partitioning foreground into fine-grained segments and aggregating background information into a single coarse-grained token. To incorporate 3D depth information with low cost, we design a novel depth position encoding method tailored for depth maps. Furthermore, we introduce a token-supervised prediction loss to provide straightforward guidance for the task-related tokens. We evaluate our VST++ model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets. Experimental results show that our model outperforms existing methods while achieving a 25% reduction in computational costs without significant performance compromise. The demonstrated strong ability for generalization, enhanced performance, and heightened efficiency of our VST++ model highlight its potential.

AB - While previous CNN-based models have exhibited promising results for salient object detection (SOD), their ability to explore global long-range dependencies is restricted. Our previous work, the Visual Saliency Transformer (VST), addressed this constraint from a transformer-based sequence-to-sequence perspective, to unify RGB and RGB-D SOD. In VST, we developed a multi-task transformer decoder that concurrently predicts saliency and boundary outcomes in a pure transformer architecture. Moreover, we introduced a novel token upsampling method called reverse T2T for predicting a high-resolution saliency map effortlessly within transformer-based structures. Building upon the VST model, we further propose an efficient and stronger VST version in this work, i.e.VST++. To mitigate the computational costs of the VST model, we propose a Select-Integrate Attention (SIA) module, partitioning foreground into fine-grained segments and aggregating background information into a single coarse-grained token. To incorporate 3D depth information with low cost, we design a novel depth position encoding method tailored for depth maps. Furthermore, we introduce a token-supervised prediction loss to provide straightforward guidance for the task-related tokens. We evaluate our VST++ model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets. Experimental results show that our model outperforms existing methods while achieving a 25% reduction in computational costs without significant performance compromise. The demonstrated strong ability for generalization, enhanced performance, and heightened efficiency of our VST++ model highlight its potential.

KW - Multi-task learning

KW - RGB-D saliency detection

KW - RGB-T saliency detection

KW - saliency detection

KW - transformer

UR - http://www.scopus.com/inward/record.url?scp=85190324833&partnerID=8YFLogxK

U2 - 10.1109/TPAMI.2024.3388153

DO - 10.1109/TPAMI.2024.3388153

M3 - 文章

AN - SCOPUS:85190324833

SN - 0162-8828

VL - 46

SP - 7300

EP - 7316

JO - IEEE Transactions on Pattern Analysis and Machine Intelligence

JF - IEEE Transactions on Pattern Analysis and Machine Intelligence

IS - 11

ER -

VST++: Efficient and Stronger Visual Saliency Transformer

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this