Scene video text tracking based on hybrid deep text detection and layout constraint

Xihan Wang; Xiaoyi Feng; Zhaoqiang Xia

doi:10.1016/j.neucom.2019.05.101

Scene video text tracking based on hybrid deep text detection and layout constraint

Xihan Wang, Xiaoyi Feng, Zhaoqiang Xia

School of Electronics and Information

Northwestern Polytechnical University Xian

Research output: Contribution to journal › Article › peer-review

20 Scopus citations

Abstract

Video text in real-world scenes often carries rich high-level semantic information and plays an ever-increasingly important role in the content-based video analysis and retrieval. Therefore, the scene video text detection and tracking are important prerequisites of numerous multimedia applications. However, the performance of most existing tracking methods is not satisfactory due to frequent mis-detections, unexpected camera motion and similar appearances between text regions. To address these problems, we propose a new video text tracking approach based on hybrid deep text detection and layout constraint. Firstly, a deep text detection network that combines the advantages of object detection and semantic segmentation in a hybrid way is proposed to locate possible text candidates in individual frames. Then, text trajectories are derived from consecutive frames with a novel data association method, which effectively exploits the layout constraint of text regions in large camera motion. By utilizing the layout constraint, the ambiguities caused by similar text regions are effectively reduced. We conduct experiments on four benchmark datasets, i.e., ICDAR 2015, MSRA-TD 500, USTB-SV1K and Minetto, to evaluate the proposed method. The experimental results demonstrate the effectiveness and superiority of the proposed approach.

Original language	English
Pages (from-to)	223-235
Number of pages	13
Journal	Neurocomputing
Volume	363
DOIs	https://doi.org/10.1016/j.neucom.2019.05.101
State	Published - 21 Oct 2019

Keywords

Convolutional neural networks
Hybrid architecture
Layout constraint
Scene video text
Text detection and tracking

Access to Document

10.1016/j.neucom.2019.05.101

Cite this

@article{87c921f1feb04f398b2b4f423bb27685,

title = "Scene video text tracking based on hybrid deep text detection and layout constraint",

abstract = "Video text in real-world scenes often carries rich high-level semantic information and plays an ever-increasingly important role in the content-based video analysis and retrieval. Therefore, the scene video text detection and tracking are important prerequisites of numerous multimedia applications. However, the performance of most existing tracking methods is not satisfactory due to frequent mis-detections, unexpected camera motion and similar appearances between text regions. To address these problems, we propose a new video text tracking approach based on hybrid deep text detection and layout constraint. Firstly, a deep text detection network that combines the advantages of object detection and semantic segmentation in a hybrid way is proposed to locate possible text candidates in individual frames. Then, text trajectories are derived from consecutive frames with a novel data association method, which effectively exploits the layout constraint of text regions in large camera motion. By utilizing the layout constraint, the ambiguities caused by similar text regions are effectively reduced. We conduct experiments on four benchmark datasets, i.e., ICDAR 2015, MSRA-TD 500, USTB-SV1K and Minetto, to evaluate the proposed method. The experimental results demonstrate the effectiveness and superiority of the proposed approach.",

keywords = "Convolutional neural networks, Hybrid architecture, Layout constraint, Scene video text, Text detection and tracking",

author = "Xihan Wang and Xiaoyi Feng and Zhaoqiang Xia",

note = "Publisher Copyright: {\textcopyright} 2019 Elsevier B.V.",

year = "2019",

month = oct,

day = "21",

doi = "10.1016/j.neucom.2019.05.101",

language = "英语",

volume = "363",

pages = "223--235",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Scene video text tracking based on hybrid deep text detection and layout constraint

AU - Wang, Xihan

AU - Feng, Xiaoyi

AU - Xia, Zhaoqiang

PY - 2019/10/21

Y1 - 2019/10/21

N2 - Video text in real-world scenes often carries rich high-level semantic information and plays an ever-increasingly important role in the content-based video analysis and retrieval. Therefore, the scene video text detection and tracking are important prerequisites of numerous multimedia applications. However, the performance of most existing tracking methods is not satisfactory due to frequent mis-detections, unexpected camera motion and similar appearances between text regions. To address these problems, we propose a new video text tracking approach based on hybrid deep text detection and layout constraint. Firstly, a deep text detection network that combines the advantages of object detection and semantic segmentation in a hybrid way is proposed to locate possible text candidates in individual frames. Then, text trajectories are derived from consecutive frames with a novel data association method, which effectively exploits the layout constraint of text regions in large camera motion. By utilizing the layout constraint, the ambiguities caused by similar text regions are effectively reduced. We conduct experiments on four benchmark datasets, i.e., ICDAR 2015, MSRA-TD 500, USTB-SV1K and Minetto, to evaluate the proposed method. The experimental results demonstrate the effectiveness and superiority of the proposed approach.

AB - Video text in real-world scenes often carries rich high-level semantic information and plays an ever-increasingly important role in the content-based video analysis and retrieval. Therefore, the scene video text detection and tracking are important prerequisites of numerous multimedia applications. However, the performance of most existing tracking methods is not satisfactory due to frequent mis-detections, unexpected camera motion and similar appearances between text regions. To address these problems, we propose a new video text tracking approach based on hybrid deep text detection and layout constraint. Firstly, a deep text detection network that combines the advantages of object detection and semantic segmentation in a hybrid way is proposed to locate possible text candidates in individual frames. Then, text trajectories are derived from consecutive frames with a novel data association method, which effectively exploits the layout constraint of text regions in large camera motion. By utilizing the layout constraint, the ambiguities caused by similar text regions are effectively reduced. We conduct experiments on four benchmark datasets, i.e., ICDAR 2015, MSRA-TD 500, USTB-SV1K and Minetto, to evaluate the proposed method. The experimental results demonstrate the effectiveness and superiority of the proposed approach.

KW - Convolutional neural networks

KW - Hybrid architecture

KW - Layout constraint

KW - Scene video text

KW - Text detection and tracking

UR - http://www.scopus.com/inward/record.url?scp=85071329307&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2019.05.101

DO - 10.1016/j.neucom.2019.05.101

M3 - 文章

AN - SCOPUS:85071329307

SN - 0925-2312

VL - 363

SP - 223

EP - 235

JO - Neurocomputing

JF - Neurocomputing

ER -

Scene video text tracking based on hybrid deep text detection and layout constraint

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this