TY - JOUR
T1 - Cas-OVD
T2 - Cascaded Open-Vocabulary Detection of Small Objects Using Multi-Refined Region Proposal Network in Autonomous Driving
AU - Fang, Zhenyu
AU - Wu, Yulong
AU - Ren, Jinchang
AU - Zheng, Jiangbin
AU - Yan, Yijun
AU - Zhang, Lixiang
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2026
Y1 - 2026
N2 - Although text information has aided existing models to achieve promising results in open vocabulary object detection (OVD), the lack of semantic information has led to the difficulty in small objects detection (SOD). Moreover, such semantic gap also causes failure when matching texts and image features, resulting in false negative instances being detected. To address these issues, we propose a Cascade Open Vocabulary Detector (Cas-OVD), which builds upon existing multi-stage detection pipelines but specializes in text-vision alignment for small objects. In particular, we adapt a multi-refined region proposal network, guided by a non-sampled anchor strategy, to reduce the missing and false detections of small objects. Meanwhile, a deformable convolution network based feature conversion module is proposed to enhance the semantic information of small objects even the potential ones with low confidence. Unlike existing methods that rely on coarse-grained image-based features for image-text matching, Cas-OVD refines these features through a cascade alignment process, allowing each stage to build on the results of the previous one. This can progressively enhance the feature correlation between the image regions and the textual descriptions through successive error correction. On the joint BDD100K-SODA-D dataset, CasOVD achieved 17.95% APall and 14.6% APs, outperforming RegionCLIP by 3.5% APall and 3.0% APs, respectively. On the OV_COCO dataset, Cas-OVD has the 32.71% APall and 17.26% APs, surpassing the RegionCLIP by 6.6% APall and 6.1% APs, respectively.
AB - Although text information has aided existing models to achieve promising results in open vocabulary object detection (OVD), the lack of semantic information has led to the difficulty in small objects detection (SOD). Moreover, such semantic gap also causes failure when matching texts and image features, resulting in false negative instances being detected. To address these issues, we propose a Cascade Open Vocabulary Detector (Cas-OVD), which builds upon existing multi-stage detection pipelines but specializes in text-vision alignment for small objects. In particular, we adapt a multi-refined region proposal network, guided by a non-sampled anchor strategy, to reduce the missing and false detections of small objects. Meanwhile, a deformable convolution network based feature conversion module is proposed to enhance the semantic information of small objects even the potential ones with low confidence. Unlike existing methods that rely on coarse-grained image-based features for image-text matching, Cas-OVD refines these features through a cascade alignment process, allowing each stage to build on the results of the previous one. This can progressively enhance the feature correlation between the image regions and the textual descriptions through successive error correction. On the joint BDD100K-SODA-D dataset, CasOVD achieved 17.95% APall and 14.6% APs, outperforming RegionCLIP by 3.5% APall and 3.0% APs, respectively. On the OV_COCO dataset, Cas-OVD has the 32.71% APall and 17.26% APs, surpassing the RegionCLIP by 6.6% APall and 6.1% APs, respectively.
KW - Open-vocabulary detection (OVD)
KW - cascaded OVD
KW - refined proposal network (RPN)
KW - small object detection (SOD)
UR - https://www.scopus.com/pages/publications/105021945451
U2 - 10.1109/TMM.2025.3632649
DO - 10.1109/TMM.2025.3632649
M3 - 文章
AN - SCOPUS:105021945451
SN - 1520-9210
VL - 28
SP - 757
EP - 771
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -