TY - JOUR
T1 - Balancing Optimization Strategies and Practical Goals
T2 - An Efficient Scene Text Detector
AU - Han, Xu
AU - Yang, Chuang
AU - Gao, Junyu
AU - Wang, Qi
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2026
Y1 - 2026
N2 - Scene text reading is a crucial task for scene understanding. Text detection, as a fundamental task in scene text reading, has recently garnered significant attention. Among various approaches, segmentation-based methods stand out for their flexible pixel-level prediction capabilities. However, two main issues remain. 1) These methods treat all text instances as a pixel set during training, causing the features of large-scale instances to dominate the model optimization process. As a result, the optimization deviates from the instance-level objectives. 2) Segmentation methods filter candidates based on pixel-level class scores, whereas what is needed is an evaluation of whether an instance is text, which also deviates from the original goals. To address these issues, we propose an Instance-Equal Feature Guide Module (IEFGM), a Cross-Level Feature Interaction Module (CLIFM), and a Pixel-Instance Fusion Discriminator (PIFD) to balance optimization strategies with practical goals. The IEFGM introduces instance-level features and positional information, guiding the model to treat instances of different scales equally at the feature level. The CLIFM encourages feature interaction across different levels, enabling the model to recognize text from various perspectives. Unlike existing methods that filter candidates using pixel-level results, the PIFD integrates both instance-level and pixel-level information to identify candidate regions, aligning with the original goals of text detection. A series of ablation studies demonstrates the effectiveness of the proposed modules. Extensive experiments across six datasets from different scenes demonstrate that our method outperforms existing state-of-the-art approaches.
AB - Scene text reading is a crucial task for scene understanding. Text detection, as a fundamental task in scene text reading, has recently garnered significant attention. Among various approaches, segmentation-based methods stand out for their flexible pixel-level prediction capabilities. However, two main issues remain. 1) These methods treat all text instances as a pixel set during training, causing the features of large-scale instances to dominate the model optimization process. As a result, the optimization deviates from the instance-level objectives. 2) Segmentation methods filter candidates based on pixel-level class scores, whereas what is needed is an evaluation of whether an instance is text, which also deviates from the original goals. To address these issues, we propose an Instance-Equal Feature Guide Module (IEFGM), a Cross-Level Feature Interaction Module (CLIFM), and a Pixel-Instance Fusion Discriminator (PIFD) to balance optimization strategies with practical goals. The IEFGM introduces instance-level features and positional information, guiding the model to treat instances of different scales equally at the feature level. The CLIFM encourages feature interaction across different levels, enabling the model to recognize text from various perspectives. Unlike existing methods that filter candidates using pixel-level results, the PIFD integrates both instance-level and pixel-level information to identify candidate regions, aligning with the original goals of text detection. A series of ablation studies demonstrates the effectiveness of the proposed modules. Extensive experiments across six datasets from different scenes demonstrate that our method outperforms existing state-of-the-art approaches.
KW - Object detection
KW - multi-scene
KW - semantic segmentation
KW - text detection
UR - https://www.scopus.com/pages/publications/105020274484
U2 - 10.1109/TMM.2025.3623548
DO - 10.1109/TMM.2025.3623548
M3 - 文章
AN - SCOPUS:105020274484
SN - 1520-9210
VL - 28
SP - 426
EP - 438
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -