TY - GEN
T1 - Not All Features Matter
T2 - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
AU - Zhu, Xiangyang
AU - Zhang, Renrui
AU - He, Bowei
AU - Zhou, Aojun
AU - Wang, Dong
AU - Zhao, Bin
AU - Gao, Peng
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - The popularity of Contrastive Language-Image Pretraining (CLIP) has propelled its application to diverse downstream vision tasks. To improve its capacity on downstream tasks, few-shot learning has become a widelya-dopted technique. However, existing methods either exhibit limited performance or suffer from excessive learnable parameters. In this paper, we propose APE, an Adaptive Prior rEfinement method for CLIP's pre-trained knowledge, which achieves superior accuracy with high computational efficiency. Via a prior refinement module, we analyze the inter-class disparity in the downstream data and decouple the domain-specific knowledge from the CLIP-extracted cache model. On top of that, we introduce two model variants, a training-free APE and a training-required APE-T. We explore the trilateral affinities between the test image, prior cache model, and textual representations, and only enable a lightweight category-residual module to be trained. For the average accuracy over 11 benchmarks, both APE and APE-T attain state-of-the-art and respectively outperform the second-best by +1.59% and +1.99% under 16 shots with ×30 less learnable parameters. Code is available at https://github.com/yangyangyang127/APE.
AB - The popularity of Contrastive Language-Image Pretraining (CLIP) has propelled its application to diverse downstream vision tasks. To improve its capacity on downstream tasks, few-shot learning has become a widelya-dopted technique. However, existing methods either exhibit limited performance or suffer from excessive learnable parameters. In this paper, we propose APE, an Adaptive Prior rEfinement method for CLIP's pre-trained knowledge, which achieves superior accuracy with high computational efficiency. Via a prior refinement module, we analyze the inter-class disparity in the downstream data and decouple the domain-specific knowledge from the CLIP-extracted cache model. On top of that, we introduce two model variants, a training-free APE and a training-required APE-T. We explore the trilateral affinities between the test image, prior cache model, and textual representations, and only enable a lightweight category-residual module to be trained. For the average accuracy over 11 benchmarks, both APE and APE-T attain state-of-the-art and respectively outperform the second-best by +1.59% and +1.99% under 16 shots with ×30 less learnable parameters. Code is available at https://github.com/yangyangyang127/APE.
UR - http://www.scopus.com/inward/record.url?scp=85179107820&partnerID=8YFLogxK
U2 - 10.1109/ICCV51070.2023.00246
DO - 10.1109/ICCV51070.2023.00246
M3 - 会议稿件
AN - SCOPUS:85179107820
T3 - Proceedings of the IEEE International Conference on Computer Vision
SP - 2605
EP - 2615
BT - Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 2 October 2023 through 6 October 2023
ER -