Rectify ViT Shortcut Learning by Visual Saliency

Chong Ma; Lin Zhao; Yuzhong Chen; Lei Guo; Tuo Zhang; Xintao Hu; Dinggang Shen; Xi Jiang; Tianming Liu

doi:10.1109/TNNLS.2023.3310531

Rectify ViT Shortcut Learning by Visual Saliency

Chong Ma, Lin Zhao, Yuzhong Chen, Lei Guo, Tuo Zhang, Xintao Hu, Dinggang Shen, Xi Jiang, Tianming Liu

自动化学院

科研成果: 期刊稿件 › 文章 › 同行评审

6 引用（Scopus）

摘要

Shortcut learning in deep learning models occurs when unintended features are prioritized, resulting in degenerated feature representations and reduced generalizability and interpretability. However, shortcut learning in the widely used vision transformer (ViT) framework is largely unknown. Meanwhile, introducing domain-specific knowledge is a major approach to rectifying the shortcuts that are predominated by background-related factors. For example, eye-gaze data from radiologists are effective human visual prior knowledge that has the great potential to guide the deep learning models to focus on meaningful foreground regions. However, obtaining eye-gaze data can still sometimes be time-consuming, labor-intensive, and even impractical. In this work, we propose a novel and effective saliency-guided ViT (SGT) model to rectify shortcut learning in ViT with the absence of eye-gaze data. Specifically, a computational visual saliency model (either pretrained or fine-tuned) is adopted to predict saliency maps for input image samples. Then, the saliency maps are used to filter the most informative image patches. Considering that this filter operation may lead to global information loss, we further introduce a residual connection that calculates the self-attention across all the image patches. The experiment results on natural and medical image datasets show that our SGT framework can effectively learn and leverage human prior knowledge without eye-gaze data and achieves much better performance than baselines. Meanwhile, it successfully rectifies the harmful shortcut learning and significantly improves the interpretability of the ViT model, demonstrating the promise of transferring human prior knowledge derived visual saliency in rectifying shortcut learning.

源语言	英语
页（从-至）	18013-18025
页数	13
期刊	IEEE Transactions on Neural Networks and Learning Systems
卷	35
期	12
DOI	https://doi.org/10.1109/TNNLS.2023.3310531
出版状态	已出版 - 2024

访问文件

10.1109/TNNLS.2023.3310531

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{b6f12230517740aca1b8ebb43d3d55a9,

title = "Rectify ViT Shortcut Learning by Visual Saliency",

abstract = "Shortcut learning in deep learning models occurs when unintended features are prioritized, resulting in degenerated feature representations and reduced generalizability and interpretability. However, shortcut learning in the widely used vision transformer (ViT) framework is largely unknown. Meanwhile, introducing domain-specific knowledge is a major approach to rectifying the shortcuts that are predominated by background-related factors. For example, eye-gaze data from radiologists are effective human visual prior knowledge that has the great potential to guide the deep learning models to focus on meaningful foreground regions. However, obtaining eye-gaze data can still sometimes be time-consuming, labor-intensive, and even impractical. In this work, we propose a novel and effective saliency-guided ViT (SGT) model to rectify shortcut learning in ViT with the absence of eye-gaze data. Specifically, a computational visual saliency model (either pretrained or fine-tuned) is adopted to predict saliency maps for input image samples. Then, the saliency maps are used to filter the most informative image patches. Considering that this filter operation may lead to global information loss, we further introduce a residual connection that calculates the self-attention across all the image patches. The experiment results on natural and medical image datasets show that our SGT framework can effectively learn and leverage human prior knowledge without eye-gaze data and achieves much better performance than baselines. Meanwhile, it successfully rectifies the harmful shortcut learning and significantly improves the interpretability of the ViT model, demonstrating the promise of transferring human prior knowledge derived visual saliency in rectifying shortcut learning.",

keywords = "Interpretability, saliency, shortcut learning, vision transformer (ViT)",

author = "Chong Ma and Lin Zhao and Yuzhong Chen and Lei Guo and Tuo Zhang and Xintao Hu and Dinggang Shen and Xi Jiang and Tianming Liu",

note = "Publisher Copyright: {\textcopyright} 2012 IEEE.",

year = "2024",

doi = "10.1109/TNNLS.2023.3310531",

language = "英语",

volume = "35",

pages = "18013--18025",

journal = "IEEE Transactions on Neural Networks and Learning Systems",

issn = "2162-237X",

publisher = "IEEE Computational Intelligence Society",

number = "12",

}

TY - JOUR

T1 - Rectify ViT Shortcut Learning by Visual Saliency

AU - Ma, Chong

AU - Zhao, Lin

AU - Chen, Yuzhong

AU - Guo, Lei

AU - Zhang, Tuo

AU - Hu, Xintao

AU - Shen, Dinggang

AU - Jiang, Xi

AU - Liu, Tianming

PY - 2024

Y1 - 2024

N2 - Shortcut learning in deep learning models occurs when unintended features are prioritized, resulting in degenerated feature representations and reduced generalizability and interpretability. However, shortcut learning in the widely used vision transformer (ViT) framework is largely unknown. Meanwhile, introducing domain-specific knowledge is a major approach to rectifying the shortcuts that are predominated by background-related factors. For example, eye-gaze data from radiologists are effective human visual prior knowledge that has the great potential to guide the deep learning models to focus on meaningful foreground regions. However, obtaining eye-gaze data can still sometimes be time-consuming, labor-intensive, and even impractical. In this work, we propose a novel and effective saliency-guided ViT (SGT) model to rectify shortcut learning in ViT with the absence of eye-gaze data. Specifically, a computational visual saliency model (either pretrained or fine-tuned) is adopted to predict saliency maps for input image samples. Then, the saliency maps are used to filter the most informative image patches. Considering that this filter operation may lead to global information loss, we further introduce a residual connection that calculates the self-attention across all the image patches. The experiment results on natural and medical image datasets show that our SGT framework can effectively learn and leverage human prior knowledge without eye-gaze data and achieves much better performance than baselines. Meanwhile, it successfully rectifies the harmful shortcut learning and significantly improves the interpretability of the ViT model, demonstrating the promise of transferring human prior knowledge derived visual saliency in rectifying shortcut learning.

AB - Shortcut learning in deep learning models occurs when unintended features are prioritized, resulting in degenerated feature representations and reduced generalizability and interpretability. However, shortcut learning in the widely used vision transformer (ViT) framework is largely unknown. Meanwhile, introducing domain-specific knowledge is a major approach to rectifying the shortcuts that are predominated by background-related factors. For example, eye-gaze data from radiologists are effective human visual prior knowledge that has the great potential to guide the deep learning models to focus on meaningful foreground regions. However, obtaining eye-gaze data can still sometimes be time-consuming, labor-intensive, and even impractical. In this work, we propose a novel and effective saliency-guided ViT (SGT) model to rectify shortcut learning in ViT with the absence of eye-gaze data. Specifically, a computational visual saliency model (either pretrained or fine-tuned) is adopted to predict saliency maps for input image samples. Then, the saliency maps are used to filter the most informative image patches. Considering that this filter operation may lead to global information loss, we further introduce a residual connection that calculates the self-attention across all the image patches. The experiment results on natural and medical image datasets show that our SGT framework can effectively learn and leverage human prior knowledge without eye-gaze data and achieves much better performance than baselines. Meanwhile, it successfully rectifies the harmful shortcut learning and significantly improves the interpretability of the ViT model, demonstrating the promise of transferring human prior knowledge derived visual saliency in rectifying shortcut learning.

KW - Interpretability

KW - saliency

KW - shortcut learning

KW - vision transformer (ViT)

UR - http://www.scopus.com/inward/record.url?scp=85171803934&partnerID=8YFLogxK

U2 - 10.1109/TNNLS.2023.3310531

DO - 10.1109/TNNLS.2023.3310531

M3 - 文章

AN - SCOPUS:85171803934

SN - 2162-237X

VL - 35

SP - 18013

EP - 18025

JO - IEEE Transactions on Neural Networks and Learning Systems

JF - IEEE Transactions on Neural Networks and Learning Systems

IS - 12

ER -

Rectify ViT Shortcut Learning by Visual Saliency

摘要

访问文件

其它文件与链接

指纹

引用此