CAT: A Simple yet Effective Cross-Attention Transformer for One-Shot Object Detection

Wei Dong Lin; Yu Yan Deng; Yang Gao; Ning Wang; Ling Qiao Liu; Lei Zhang; Peng Wang

doi:10.1007/s11390-024-1743-6

CAT: A Simple yet Effective Cross-Attention Transformer for One-Shot Object Detection

Wei Dong Lin, Yu Yan Deng, Yang Gao, Ning Wang, Ling Qiao Liu, Lei Zhang, Peng Wang

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Given a query patch from a novel class, one-shot object detection aims to detect all instances of this class in a target image through the semantic similarity comparison. However, due to the extremely limited guidance in the novel class as well as the unseen appearance difference between the query and target instances, it is difficult to appropriately exploit their semantic similarity and generalize well. To mitigate this problem, we present a universal Cross-Attention Transformer (CAT) module for accurate and efficient semantic similarity comparison in one-shot object detection. The proposed CAT utilizes the transformer mechanism to comprehensively capture bi-directional correspondence between any paired pixels from the query and the target image, which empowers us to sufficiently exploit their semantic characteristics for accurate similarity comparison. In addition, the proposed CAT enables feature dimensionality compression for inference speedup without performance loss. Extensive experiments on three object detection datasets MS-COCO, PASCAL VOC and FSOD under the one-shot setting demonstrate the effectiveness and efficiency of our model, e.g., it surpasses CoAE, a major baseline in this task, by 1.0% in average precision (AP) on MS-COCO and runs nearly 2.5 times faster.

源语言	英语
页（从-至）	460-471
页数	12
期刊	Journal of Computer Science and Technology
卷	39
期	2
DOI	https://doi.org/10.1007/s11390-024-1743-6
出版状态	已出版 - 3月 2024

访问文件

10.1007/s11390-024-1743-6

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{24d6b7ff9c4149638ac6e22560edb18d,

title = "CAT: A Simple yet Effective Cross-Attention Transformer for One-Shot Object Detection",

abstract = "Given a query patch from a novel class, one-shot object detection aims to detect all instances of this class in a target image through the semantic similarity comparison. However, due to the extremely limited guidance in the novel class as well as the unseen appearance difference between the query and target instances, it is difficult to appropriately exploit their semantic similarity and generalize well. To mitigate this problem, we present a universal Cross-Attention Transformer (CAT) module for accurate and efficient semantic similarity comparison in one-shot object detection. The proposed CAT utilizes the transformer mechanism to comprehensively capture bi-directional correspondence between any paired pixels from the query and the target image, which empowers us to sufficiently exploit their semantic characteristics for accurate similarity comparison. In addition, the proposed CAT enables feature dimensionality compression for inference speedup without performance loss. Extensive experiments on three object detection datasets MS-COCO, PASCAL VOC and FSOD under the one-shot setting demonstrate the effectiveness and efficiency of our model, e.g., it surpasses CoAE, a major baseline in this task, by 1.0% in average precision (AP) on MS-COCO and runs nearly 2.5 times faster.",

keywords = "attention mechanism, one-shot object detection, Transformer",

author = "Lin, {Wei Dong} and Deng, {Yu Yan} and Yang Gao and Ning Wang and Liu, {Ling Qiao} and Lei Zhang and Peng Wang",

note = "Publisher Copyright: {\textcopyright} Institute of Computing Technology, Chinese Academy of Sciences 2024.",

year = "2024",

month = mar,

doi = "10.1007/s11390-024-1743-6",

language = "英语",

volume = "39",

pages = "460--471",

journal = "Journal of Computer Science and Technology",

issn = "1000-9000",

publisher = "Springer New York",

number = "2",

}

TY - JOUR

T1 - CAT

T2 - A Simple yet Effective Cross-Attention Transformer for One-Shot Object Detection

AU - Lin, Wei Dong

AU - Deng, Yu Yan

AU - Gao, Yang

AU - Wang, Ning

AU - Liu, Ling Qiao

AU - Zhang, Lei

AU - Wang, Peng

N1 - Publisher Copyright: © Institute of Computing Technology, Chinese Academy of Sciences 2024.

PY - 2024/3

Y1 - 2024/3

N2 - Given a query patch from a novel class, one-shot object detection aims to detect all instances of this class in a target image through the semantic similarity comparison. However, due to the extremely limited guidance in the novel class as well as the unseen appearance difference between the query and target instances, it is difficult to appropriately exploit their semantic similarity and generalize well. To mitigate this problem, we present a universal Cross-Attention Transformer (CAT) module for accurate and efficient semantic similarity comparison in one-shot object detection. The proposed CAT utilizes the transformer mechanism to comprehensively capture bi-directional correspondence between any paired pixels from the query and the target image, which empowers us to sufficiently exploit their semantic characteristics for accurate similarity comparison. In addition, the proposed CAT enables feature dimensionality compression for inference speedup without performance loss. Extensive experiments on three object detection datasets MS-COCO, PASCAL VOC and FSOD under the one-shot setting demonstrate the effectiveness and efficiency of our model, e.g., it surpasses CoAE, a major baseline in this task, by 1.0% in average precision (AP) on MS-COCO and runs nearly 2.5 times faster.

AB - Given a query patch from a novel class, one-shot object detection aims to detect all instances of this class in a target image through the semantic similarity comparison. However, due to the extremely limited guidance in the novel class as well as the unseen appearance difference between the query and target instances, it is difficult to appropriately exploit their semantic similarity and generalize well. To mitigate this problem, we present a universal Cross-Attention Transformer (CAT) module for accurate and efficient semantic similarity comparison in one-shot object detection. The proposed CAT utilizes the transformer mechanism to comprehensively capture bi-directional correspondence between any paired pixels from the query and the target image, which empowers us to sufficiently exploit their semantic characteristics for accurate similarity comparison. In addition, the proposed CAT enables feature dimensionality compression for inference speedup without performance loss. Extensive experiments on three object detection datasets MS-COCO, PASCAL VOC and FSOD under the one-shot setting demonstrate the effectiveness and efficiency of our model, e.g., it surpasses CoAE, a major baseline in this task, by 1.0% in average precision (AP) on MS-COCO and runs nearly 2.5 times faster.

KW - attention mechanism

KW - one-shot object detection

KW - Transformer

UR - http://www.scopus.com/inward/record.url?scp=85195504162&partnerID=8YFLogxK

U2 - 10.1007/s11390-024-1743-6

DO - 10.1007/s11390-024-1743-6

M3 - 文章

AN - SCOPUS:85195504162

SN - 1000-9000

VL - 39

SP - 460

EP - 471

JO - Journal of Computer Science and Technology

JF - Journal of Computer Science and Technology

IS - 2

ER -

CAT: A Simple yet Effective Cross-Attention Transformer for One-Shot Object Detection

摘要

访问文件

其它文件与链接

指纹

引用此