How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection

Yiyang Yao; Peng Liu; Tiancheng Zhao; Qianqian Zhang; Jiajia Liao; Chunxin Fang; Kyusong Lee; Qing Wang

doi:10.1609/aaai.v38i7.28485

How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection

Yiyang Yao, Peng Liu, Tiancheng Zhao, Qianqian Zhang, Jiajia Liao, Chunxin Fang, Kyusong Lee, Qing Wang

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

2 引用（Scopus）

摘要

Object detection (OD) in computer vision has made significant progress in recent years, transitioning from closed-set labels to open-vocabulary detection (OVD) based on large-scale vision-language pre-training (VLP). However, current evaluation methods and datasets are limited to testing generalization over object types and referral expressions, which do not provide a systematic, fine-grained, and accurate benchmark of OVD models' abilities. In this paper, we propose a new benchmark named OVDEval, which includes 9 subtasks and introduces evaluations on commonsense knowledge, attribute understanding, position understanding, object relation comprehension, and more. The dataset is meticulously created to provide hard negatives that challenge models' true understanding of visual and linguistic input. Additionally, we identify a problem with the popular Average Precision (AP) metric when benchmarking models on these fine-grained label datasets and propose a new metric called Non-Maximum Suppression Average Precision (NMS-AP) to address this issue. Extensive experimental results show that existing top OVD models all fail on the new tasks except for simple object types, demonstrating the value of the proposed dataset in pinpointing the weakness of current OVD models and guiding future research. Furthermore, the proposed NMS-AP metric is verified by experiments to provide a much more truthful evaluation of OVD models, whereas traditional AP metrics yield deceptive results. Data is available at https://github.com/om-ai-lab/OVDEval.

源语言	英语
主期刊名	Technical Tracks 14
编辑	Michael Wooldridge, Jennifer Dy, Sriraam Natarajan
出版商	Association for the Advancement of Artificial Intelligence
页	6630-6638
页数	9
版本	7
ISBN（电子版）	1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879
DOI	https://doi.org/10.1609/aaai.v38i7.28485
出版状态	已出版 - 25 3月 2024
活动	38th AAAI Conference on Artificial Intelligence, AAAI 2024 - Vancouver, 加拿大期限: 20 2月 2024 → 27 2月 2024

出版系列

姓名	Proceedings of the AAAI Conference on Artificial Intelligence
编号	7
卷	38
ISSN（印刷版）	2159-5399
ISSN（电子版）	2374-3468

会议

会议	38th AAAI Conference on Artificial Intelligence, AAAI 2024
国家/地区	加拿大
市	Vancouver
时期	20/02/24 → 27/02/24

访问文件

10.1609/aaai.v38i7.28485

其它文件与链接

链接到 Scopus 的出版物

引用此

Yao, Y., Liu, P., Zhao, T., Zhang, Q., Liao, J., Fang, C., Lee, K., & Wang, Q. (2024). How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection. 在 M. Wooldridge, J. Dy, & S. Natarajan (编辑), Technical Tracks 14 (7 编辑, 页码 6630-6638). (Proceedings of the AAAI Conference on Artificial Intelligence; 卷 38, 号码 7). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v38i7.28485

Yao, Yiyang ; Liu, Peng ; Zhao, Tiancheng 等. / How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection. Technical Tracks 14. 编辑 / Michael Wooldridge ; Jennifer Dy ; Sriraam Natarajan. 7. 编辑 Association for the Advancement of Artificial Intelligence, 2024. 页码 6630-6638 (Proceedings of the AAAI Conference on Artificial Intelligence; 7).

@inproceedings{a0ca1bb83aee4a738434a77336f9eccb,

title = "How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection",

abstract = "Object detection (OD) in computer vision has made significant progress in recent years, transitioning from closed-set labels to open-vocabulary detection (OVD) based on large-scale vision-language pre-training (VLP). However, current evaluation methods and datasets are limited to testing generalization over object types and referral expressions, which do not provide a systematic, fine-grained, and accurate benchmark of OVD models' abilities. In this paper, we propose a new benchmark named OVDEval, which includes 9 subtasks and introduces evaluations on commonsense knowledge, attribute understanding, position understanding, object relation comprehension, and more. The dataset is meticulously created to provide hard negatives that challenge models' true understanding of visual and linguistic input. Additionally, we identify a problem with the popular Average Precision (AP) metric when benchmarking models on these fine-grained label datasets and propose a new metric called Non-Maximum Suppression Average Precision (NMS-AP) to address this issue. Extensive experimental results show that existing top OVD models all fail on the new tasks except for simple object types, demonstrating the value of the proposed dataset in pinpointing the weakness of current OVD models and guiding future research. Furthermore, the proposed NMS-AP metric is verified by experiments to provide a much more truthful evaluation of OVD models, whereas traditional AP metrics yield deceptive results. Data is available at https://github.com/om-ai-lab/OVDEval.",

author = "Yiyang Yao and Peng Liu and Tiancheng Zhao and Qianqian Zhang and Jiajia Liao and Chunxin Fang and Kyusong Lee and Qing Wang",

note = "Publisher Copyright: Copyright {\textcopyright} 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.; 38th AAAI Conference on Artificial Intelligence, AAAI 2024 ; Conference date: 20-02-2024 Through 27-02-2024",

year = "2024",

month = mar,

day = "25",

doi = "10.1609/aaai.v38i7.28485",

language = "英语",

series = "Proceedings of the AAAI Conference on Artificial Intelligence",

publisher = "Association for the Advancement of Artificial Intelligence",

number = "7",

pages = "6630--6638",

editor = "Michael Wooldridge and Jennifer Dy and Sriraam Natarajan",

booktitle = "Technical Tracks 14",

edition = "7",

}

Yao, Y, Liu, P, Zhao, T, Zhang, Q, Liao, J, Fang, C, Lee, K & Wang, Q 2024, How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection. 在 M Wooldridge, J Dy & S Natarajan (编辑), Technical Tracks 14. 7 编辑, Proceedings of the AAAI Conference on Artificial Intelligence, 号码 7, 卷 38, Association for the Advancement of Artificial Intelligence, 页码 6630-6638, 38th AAAI Conference on Artificial Intelligence, AAAI 2024, Vancouver, 加拿大, 20/02/24. https://doi.org/10.1609/aaai.v38i7.28485

How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection. / Yao, Yiyang; Liu, Peng; Zhao, Tiancheng 等.
Technical Tracks 14. 编辑 / Michael Wooldridge; Jennifer Dy; Sriraam Natarajan. 7. 编辑 Association for the Advancement of Artificial Intelligence, 2024. 页码 6630-6638 (Proceedings of the AAAI Conference on Artificial Intelligence; 卷 38, 号码 7).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection

AU - Yao, Yiyang

AU - Liu, Peng

AU - Zhao, Tiancheng

AU - Zhang, Qianqian

AU - Liao, Jiajia

AU - Fang, Chunxin

AU - Lee, Kyusong

AU - Wang, Qing

PY - 2024/3/25

Y1 - 2024/3/25

N2 - Object detection (OD) in computer vision has made significant progress in recent years, transitioning from closed-set labels to open-vocabulary detection (OVD) based on large-scale vision-language pre-training (VLP). However, current evaluation methods and datasets are limited to testing generalization over object types and referral expressions, which do not provide a systematic, fine-grained, and accurate benchmark of OVD models' abilities. In this paper, we propose a new benchmark named OVDEval, which includes 9 subtasks and introduces evaluations on commonsense knowledge, attribute understanding, position understanding, object relation comprehension, and more. The dataset is meticulously created to provide hard negatives that challenge models' true understanding of visual and linguistic input. Additionally, we identify a problem with the popular Average Precision (AP) metric when benchmarking models on these fine-grained label datasets and propose a new metric called Non-Maximum Suppression Average Precision (NMS-AP) to address this issue. Extensive experimental results show that existing top OVD models all fail on the new tasks except for simple object types, demonstrating the value of the proposed dataset in pinpointing the weakness of current OVD models and guiding future research. Furthermore, the proposed NMS-AP metric is verified by experiments to provide a much more truthful evaluation of OVD models, whereas traditional AP metrics yield deceptive results. Data is available at https://github.com/om-ai-lab/OVDEval.

AB - Object detection (OD) in computer vision has made significant progress in recent years, transitioning from closed-set labels to open-vocabulary detection (OVD) based on large-scale vision-language pre-training (VLP). However, current evaluation methods and datasets are limited to testing generalization over object types and referral expressions, which do not provide a systematic, fine-grained, and accurate benchmark of OVD models' abilities. In this paper, we propose a new benchmark named OVDEval, which includes 9 subtasks and introduces evaluations on commonsense knowledge, attribute understanding, position understanding, object relation comprehension, and more. The dataset is meticulously created to provide hard negatives that challenge models' true understanding of visual and linguistic input. Additionally, we identify a problem with the popular Average Precision (AP) metric when benchmarking models on these fine-grained label datasets and propose a new metric called Non-Maximum Suppression Average Precision (NMS-AP) to address this issue. Extensive experimental results show that existing top OVD models all fail on the new tasks except for simple object types, demonstrating the value of the proposed dataset in pinpointing the weakness of current OVD models and guiding future research. Furthermore, the proposed NMS-AP metric is verified by experiments to provide a much more truthful evaluation of OVD models, whereas traditional AP metrics yield deceptive results. Data is available at https://github.com/om-ai-lab/OVDEval.

UR - http://www.scopus.com/inward/record.url?scp=85189540132&partnerID=8YFLogxK

U2 - 10.1609/aaai.v38i7.28485

DO - 10.1609/aaai.v38i7.28485

M3 - 会议稿件

AN - SCOPUS:85189540132

T3 - Proceedings of the AAAI Conference on Artificial Intelligence

SP - 6630

EP - 6638

BT - Technical Tracks 14

A2 - Wooldridge, Michael

A2 - Dy, Jennifer

A2 - Natarajan, Sriraam

PB - Association for the Advancement of Artificial Intelligence

T2 - 38th AAAI Conference on Artificial Intelligence, AAAI 2024

Y2 - 20 February 2024 through 27 February 2024

ER -

Yao Y, Liu P, Zhao T, Zhang Q, Liao J, Fang C 等. How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection. 在 Wooldridge M, Dy J, Natarajan S, 编辑, Technical Tracks 14. 7 编辑 Association for the Advancement of Artificial Intelligence. 2024. 页码 6630-6638. (Proceedings of the AAAI Conference on Artificial Intelligence; 7). doi: 10.1609/aaai.v38i7.28485

How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此