TY - JOUR
T1 - Compensating for the Incomplete With the Complete
T2 - An Efficient Scene Text Detector
AU - Han, Xu
AU - Wang, Qi
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Scene text reading is an essential component of scene understanding. As its fundamental requirement, text detection has garnered increasing attention. Segmenting the text kernel and extending it to reconstruct text instances is efficient and effective among the various methods. However, the incomplete semantic features of text kernels and the high similarity between kernels and texts make it hard to extract kernels from images accurately. Considering the above, we propose an efficient text detector, termed CIC, which comprises a bidirectional information transfer module (BITM), a dual knowledge integration module (DKIM), and a cross-verification module (CVM). The former generates collaborative information between the predicted text and kernel via the proposed differentiable adaptive gap operator. It forces mutual restraint and collaborative progress between the predictions of text and kernel. Unlike BITM, DKIM designs a knowledge fuse scheme, which helps to locate kernels accurately under the guidance of the complete semantic feature of texts. Intuitively, as the kernel is generated by shrinking the text, the kernel pixel is only presented in the text area. Based on this criterion, the CVM further utilizes text predictions to constrain kernel predictions and reduce false positive predictions. Ablation experiments demonstrate the effectiveness of the proposed BITM, DKIM, and CVM. Extensive experiments show the proposed CIC outperforms existing state-of-the-art (SOTA) methods on five public datasets from different scenes.
AB - Scene text reading is an essential component of scene understanding. As its fundamental requirement, text detection has garnered increasing attention. Segmenting the text kernel and extending it to reconstruct text instances is efficient and effective among the various methods. However, the incomplete semantic features of text kernels and the high similarity between kernels and texts make it hard to extract kernels from images accurately. Considering the above, we propose an efficient text detector, termed CIC, which comprises a bidirectional information transfer module (BITM), a dual knowledge integration module (DKIM), and a cross-verification module (CVM). The former generates collaborative information between the predicted text and kernel via the proposed differentiable adaptive gap operator. It forces mutual restraint and collaborative progress between the predictions of text and kernel. Unlike BITM, DKIM designs a knowledge fuse scheme, which helps to locate kernels accurately under the guidance of the complete semantic feature of texts. Intuitively, as the kernel is generated by shrinking the text, the kernel pixel is only presented in the text area. Based on this criterion, the CVM further utilizes text predictions to constrain kernel predictions and reduce false positive predictions. Ablation experiments demonstrate the effectiveness of the proposed BITM, DKIM, and CVM. Extensive experiments show the proposed CIC outperforms existing state-of-the-art (SOTA) methods on five public datasets from different scenes.
KW - Real-time
KW - multi-scene
KW - semantic segmentation
KW - text detection
UR - https://www.scopus.com/pages/publications/105012453190
U2 - 10.1109/TCSVT.2025.3588711
DO - 10.1109/TCSVT.2025.3588711
M3 - 文章
AN - SCOPUS:105012453190
SN - 1051-8215
VL - 35
SP - 12096
EP - 12108
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 12
ER -