Abstract
Although text information has aided existing models to achieve promising results in open vocabulary object detection (OVD), the lack of semantic information has led to the difficulty in small objects detection (SOD). Moreover, such semantic gap also causes failure when matching texts and image features, resulting in false negative instances being detected. To address these issues, we propose a Cascade Open Vocabulary Detector (Cas-OVD), which builds upon existing multi-stage detection pipelines but specializes in text-vision alignment for small objects. In particular, we adapt a multi-refined region proposal network, guided by a non-sampled anchor strategy, to reduce the missing and false detections of small objects. Meanwhile, a deformable convolution network based feature conversion module is proposed to enhance the semantic information of small objects even the potential ones with low confidence. Unlike existing methods that rely on coarse-grained image-based features for image-text matching, Cas-OVD refines these features through a cascade alignment process, allowing each stage to build on the results of the previous one. This can progressively enhance the feature correlation between the image regions and the textual descriptions through successive error correction. On the joint BDD100K-SODA-D dataset, CasOVD achieved 17.95% APall and 14.6% APs, outperforming RegionCLIP by 3.5% APall and 3.0% APs, respectively. On the OV_COCO dataset, Cas-OVD has the 32.71% APall and 17.26% APs, surpassing the RegionCLIP by 6.6% APall and 6.1% APs, respectively.
| Original language | English |
|---|---|
| Pages (from-to) | 757-771 |
| Number of pages | 15 |
| Journal | IEEE Transactions on Multimedia |
| Volume | 28 |
| DOIs | |
| State | Published - 2026 |
Keywords
- Open-vocabulary detection (OVD)
- cascaded OVD
- refined proposal network (RPN)
- small object detection (SOD)
Fingerprint
Dive into the research topics of 'Cas-OVD: Cascaded Open-Vocabulary Detection of Small Objects Using Multi-Refined Region Proposal Network in Autonomous Driving'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver