Mask-Guided Vision Transformer for Few-Shot Learning

Yuzhong Chen; Zhenxiang Xiao; Yi Pan; Lin Zhao; Haixing Dai; Zihao Wu; Changhe Li; Tuo Zhang; Changying Li; Dajiang Zhu; Tianming Liu; Xi Jiang

doi:10.1109/TNNLS.2024.3418527

Mask-Guided Vision Transformer for Few-Shot Learning

Yuzhong Chen, Zhenxiang Xiao, Yi Pan, Lin Zhao, Haixing Dai, Zihao Wu, Changhe Li, Tuo Zhang, Changying Li, Dajiang Zhu, Tianming Liu, Xi Jiang

School of Automation

Research output: Contribution to journal › Article › peer-review

Abstract

Learning with little data is challenging but often inevitable in various application scenarios where the labeled data are limited and costly. Recently, few-shot learning (FSL) gained increasing attention because of its generalizability of prior knowledge to new tasks that contain only a few samples. However, for data-intensive models such as vision transformer (ViT), current fine-tuning-based FSL approaches are inefficient in knowledge generalization and, thus, degenerate the downstream task performances. In this article, we propose a novel mask-guided ViT (MG-ViT) to achieve an effective and efficient FSL on the ViT model. The key idea is to apply a mask on image patches to screen out the task-irrelevant ones and to guide the ViT focusing on task-relevant and discriminative patches during FSL. Particularly, MG-ViT only introduces an additional mask operation and a residual connection, enabling the inheritance of parameters from pretrained ViT without any other cost. To optimally select representative few-shot samples, we also include an active learning-based sample selection method to further improve the generalizability of MG-ViT-based FSL. We evaluate the proposed MG-ViT on classification, object detection, and segmentation tasks using gradient-weighted class activation mapping (Grad-CAM) to generate masks. The experimental results show that the MG-ViT model significantly improves the performance and efficiency compared with general fine-tuning-based ViT and ResNet models, providing novel insights and a concrete approach toward generalizing data-intensive and large-scale deep learning models for FSL.

Original language	English
Pages (from-to)	9636-9647
Number of pages	12
Journal	IEEE Transactions on Neural Networks and Learning Systems
Volume	36
Issue number	5
DOIs	https://doi.org/10.1109/TNNLS.2024.3418527
State	Published - 2025

Keywords

Domain adaptation
few-shot learning (FSL)
mask
vision transformer (ViT)

Access to Document

10.1109/TNNLS.2024.3418527

Cite this

@article{52ce37199bc34021869bcc92e46bd649,

title = "Mask-Guided Vision Transformer for Few-Shot Learning",

abstract = "Learning with little data is challenging but often inevitable in various application scenarios where the labeled data are limited and costly. Recently, few-shot learning (FSL) gained increasing attention because of its generalizability of prior knowledge to new tasks that contain only a few samples. However, for data-intensive models such as vision transformer (ViT), current fine-tuning-based FSL approaches are inefficient in knowledge generalization and, thus, degenerate the downstream task performances. In this article, we propose a novel mask-guided ViT (MG-ViT) to achieve an effective and efficient FSL on the ViT model. The key idea is to apply a mask on image patches to screen out the task-irrelevant ones and to guide the ViT focusing on task-relevant and discriminative patches during FSL. Particularly, MG-ViT only introduces an additional mask operation and a residual connection, enabling the inheritance of parameters from pretrained ViT without any other cost. To optimally select representative few-shot samples, we also include an active learning-based sample selection method to further improve the generalizability of MG-ViT-based FSL. We evaluate the proposed MG-ViT on classification, object detection, and segmentation tasks using gradient-weighted class activation mapping (Grad-CAM) to generate masks. The experimental results show that the MG-ViT model significantly improves the performance and efficiency compared with general fine-tuning-based ViT and ResNet models, providing novel insights and a concrete approach toward generalizing data-intensive and large-scale deep learning models for FSL.",

keywords = "Domain adaptation, few-shot learning (FSL), mask, vision transformer (ViT)",

author = "Yuzhong Chen and Zhenxiang Xiao and Yi Pan and Lin Zhao and Haixing Dai and Zihao Wu and Changhe Li and Tuo Zhang and Changying Li and Dajiang Zhu and Tianming Liu and Xi Jiang",

note = "Publisher Copyright: {\textcopyright} 2012 IEEE.",

year = "2025",

doi = "10.1109/TNNLS.2024.3418527",

language = "英语",

volume = "36",

pages = "9636--9647",

journal = "IEEE Transactions on Neural Networks and Learning Systems",

issn = "2162-237X",

publisher = "IEEE Computational Intelligence Society",

number = "5",

}

TY - JOUR

T1 - Mask-Guided Vision Transformer for Few-Shot Learning

AU - Chen, Yuzhong

AU - Xiao, Zhenxiang

AU - Pan, Yi

AU - Zhao, Lin

AU - Dai, Haixing

AU - Wu, Zihao

AU - Li, Changhe

AU - Zhang, Tuo

AU - Li, Changying

AU - Zhu, Dajiang

AU - Liu, Tianming

AU - Jiang, Xi

PY - 2025

Y1 - 2025

N2 - Learning with little data is challenging but often inevitable in various application scenarios where the labeled data are limited and costly. Recently, few-shot learning (FSL) gained increasing attention because of its generalizability of prior knowledge to new tasks that contain only a few samples. However, for data-intensive models such as vision transformer (ViT), current fine-tuning-based FSL approaches are inefficient in knowledge generalization and, thus, degenerate the downstream task performances. In this article, we propose a novel mask-guided ViT (MG-ViT) to achieve an effective and efficient FSL on the ViT model. The key idea is to apply a mask on image patches to screen out the task-irrelevant ones and to guide the ViT focusing on task-relevant and discriminative patches during FSL. Particularly, MG-ViT only introduces an additional mask operation and a residual connection, enabling the inheritance of parameters from pretrained ViT without any other cost. To optimally select representative few-shot samples, we also include an active learning-based sample selection method to further improve the generalizability of MG-ViT-based FSL. We evaluate the proposed MG-ViT on classification, object detection, and segmentation tasks using gradient-weighted class activation mapping (Grad-CAM) to generate masks. The experimental results show that the MG-ViT model significantly improves the performance and efficiency compared with general fine-tuning-based ViT and ResNet models, providing novel insights and a concrete approach toward generalizing data-intensive and large-scale deep learning models for FSL.

AB - Learning with little data is challenging but often inevitable in various application scenarios where the labeled data are limited and costly. Recently, few-shot learning (FSL) gained increasing attention because of its generalizability of prior knowledge to new tasks that contain only a few samples. However, for data-intensive models such as vision transformer (ViT), current fine-tuning-based FSL approaches are inefficient in knowledge generalization and, thus, degenerate the downstream task performances. In this article, we propose a novel mask-guided ViT (MG-ViT) to achieve an effective and efficient FSL on the ViT model. The key idea is to apply a mask on image patches to screen out the task-irrelevant ones and to guide the ViT focusing on task-relevant and discriminative patches during FSL. Particularly, MG-ViT only introduces an additional mask operation and a residual connection, enabling the inheritance of parameters from pretrained ViT without any other cost. To optimally select representative few-shot samples, we also include an active learning-based sample selection method to further improve the generalizability of MG-ViT-based FSL. We evaluate the proposed MG-ViT on classification, object detection, and segmentation tasks using gradient-weighted class activation mapping (Grad-CAM) to generate masks. The experimental results show that the MG-ViT model significantly improves the performance and efficiency compared with general fine-tuning-based ViT and ResNet models, providing novel insights and a concrete approach toward generalizing data-intensive and large-scale deep learning models for FSL.

KW - Domain adaptation

KW - few-shot learning (FSL)

KW - mask

KW - vision transformer (ViT)

UR - http://www.scopus.com/inward/record.url?scp=105004263679&partnerID=8YFLogxK

U2 - 10.1109/TNNLS.2024.3418527

DO - 10.1109/TNNLS.2024.3418527

M3 - 文章

AN - SCOPUS:105004263679

SN - 2162-237X

VL - 36

SP - 9636

EP - 9647

JO - IEEE Transactions on Neural Networks and Learning Systems

JF - IEEE Transactions on Neural Networks and Learning Systems

IS - 5

ER -

Mask-Guided Vision Transformer for Few-Shot Learning

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this