TY - JOUR
T1 - Rectify ViT Shortcut Learning by Visual Saliency
AU - Ma, Chong
AU - Zhao, Lin
AU - Chen, Yuzhong
AU - Guo, Lei
AU - Zhang, Tuo
AU - Hu, Xintao
AU - Shen, Dinggang
AU - Jiang, Xi
AU - Liu, Tianming
N1 - Publisher Copyright:
© 2012 IEEE.
PY - 2024
Y1 - 2024
N2 - Shortcut learning in deep learning models occurs when unintended features are prioritized, resulting in degenerated feature representations and reduced generalizability and interpretability. However, shortcut learning in the widely used vision transformer (ViT) framework is largely unknown. Meanwhile, introducing domain-specific knowledge is a major approach to rectifying the shortcuts that are predominated by background-related factors. For example, eye-gaze data from radiologists are effective human visual prior knowledge that has the great potential to guide the deep learning models to focus on meaningful foreground regions. However, obtaining eye-gaze data can still sometimes be time-consuming, labor-intensive, and even impractical. In this work, we propose a novel and effective saliency-guided ViT (SGT) model to rectify shortcut learning in ViT with the absence of eye-gaze data. Specifically, a computational visual saliency model (either pretrained or fine-tuned) is adopted to predict saliency maps for input image samples. Then, the saliency maps are used to filter the most informative image patches. Considering that this filter operation may lead to global information loss, we further introduce a residual connection that calculates the self-attention across all the image patches. The experiment results on natural and medical image datasets show that our SGT framework can effectively learn and leverage human prior knowledge without eye-gaze data and achieves much better performance than baselines. Meanwhile, it successfully rectifies the harmful shortcut learning and significantly improves the interpretability of the ViT model, demonstrating the promise of transferring human prior knowledge derived visual saliency in rectifying shortcut learning.
AB - Shortcut learning in deep learning models occurs when unintended features are prioritized, resulting in degenerated feature representations and reduced generalizability and interpretability. However, shortcut learning in the widely used vision transformer (ViT) framework is largely unknown. Meanwhile, introducing domain-specific knowledge is a major approach to rectifying the shortcuts that are predominated by background-related factors. For example, eye-gaze data from radiologists are effective human visual prior knowledge that has the great potential to guide the deep learning models to focus on meaningful foreground regions. However, obtaining eye-gaze data can still sometimes be time-consuming, labor-intensive, and even impractical. In this work, we propose a novel and effective saliency-guided ViT (SGT) model to rectify shortcut learning in ViT with the absence of eye-gaze data. Specifically, a computational visual saliency model (either pretrained or fine-tuned) is adopted to predict saliency maps for input image samples. Then, the saliency maps are used to filter the most informative image patches. Considering that this filter operation may lead to global information loss, we further introduce a residual connection that calculates the self-attention across all the image patches. The experiment results on natural and medical image datasets show that our SGT framework can effectively learn and leverage human prior knowledge without eye-gaze data and achieves much better performance than baselines. Meanwhile, it successfully rectifies the harmful shortcut learning and significantly improves the interpretability of the ViT model, demonstrating the promise of transferring human prior knowledge derived visual saliency in rectifying shortcut learning.
KW - Interpretability
KW - saliency
KW - shortcut learning
KW - vision transformer (ViT)
UR - http://www.scopus.com/inward/record.url?scp=85171803934&partnerID=8YFLogxK
U2 - 10.1109/TNNLS.2023.3310531
DO - 10.1109/TNNLS.2023.3310531
M3 - 文章
AN - SCOPUS:85171803934
SN - 2162-237X
VL - 35
SP - 18013
EP - 18025
JO - IEEE Transactions on Neural Networks and Learning Systems
JF - IEEE Transactions on Neural Networks and Learning Systems
IS - 12
ER -