Tracking based on scale-estimated deep networks with hierarchical correlation ensembling for cross-media understanding

Hanqiao Huang; Yamin Han; Peng Zhang; Wei Huang

doi:10.1016/j.displa.2021.102055

Tracking based on scale-estimated deep networks with hierarchical correlation ensembling for cross-media understanding

Hanqiao Huang, Yamin Han, Peng Zhang, Wei Huang

School of Computer Science

Research output: Contribution to journal › Article › peer-review

5 Scopus citations

Abstract

In different vision based cross-media applications, the interest objects inside the visual regions usually need to be accurately localized/tracked to achieve more effective understanding and generating image descriptions (UGID), such as audio-visual lip recognition. Unfortunately, a robust tracking in realistic scenarios is usually challenged by the dynamic appearance variations when object motion is on-the-fly. Recent studies on deep neuron networks for the classification/recognition tasks have inspired a great progress in visual tracking, but the intrinsic assumption of scale invariance during target modeling still limited tracking performance to be further improved. Motivated by learning the object appearance with a scale estimation, in this study, a scale-estimated deep networks (SEN) is proposed to predict more accurate object size during tracking. By incorporating the proposed SEN into a hierarchical correlation ensembling framework, a joint translation-scale tracking scheme is accomplished to estimate the position and scale of the target object simultaneously. Substantial experiments on the challenging benchmark datasets have demonstrated that the proposed tracker is able to achieve the competitive results. Additionally, the performance evaluation of tracking lips also shows that the proposed work is also capable to support an audio-visual recognition task in different type of cross-media application.

Original language	English
Article number	102055
Journal	Displays
Volume	69
DOIs	https://doi.org/10.1016/j.displa.2021.102055
State	Published - Sep 2021

Keywords

Cross-media understanding
Hierarchical correlation ensembling
Scale-estimated deep networks
Tracking

Access to Document

10.1016/j.displa.2021.102055

Cite this

@article{014d599b9ab64949bbb214a6981454bc,

title = "Tracking based on scale-estimated deep networks with hierarchical correlation ensembling for cross-media understanding",

abstract = "In different vision based cross-media applications, the interest objects inside the visual regions usually need to be accurately localized/tracked to achieve more effective understanding and generating image descriptions (UGID), such as audio-visual lip recognition. Unfortunately, a robust tracking in realistic scenarios is usually challenged by the dynamic appearance variations when object motion is on-the-fly. Recent studies on deep neuron networks for the classification/recognition tasks have inspired a great progress in visual tracking, but the intrinsic assumption of scale invariance during target modeling still limited tracking performance to be further improved. Motivated by learning the object appearance with a scale estimation, in this study, a scale-estimated deep networks (SEN) is proposed to predict more accurate object size during tracking. By incorporating the proposed SEN into a hierarchical correlation ensembling framework, a joint translation-scale tracking scheme is accomplished to estimate the position and scale of the target object simultaneously. Substantial experiments on the challenging benchmark datasets have demonstrated that the proposed tracker is able to achieve the competitive results. Additionally, the performance evaluation of tracking lips also shows that the proposed work is also capable to support an audio-visual recognition task in different type of cross-media application.",

keywords = "Cross-media understanding, Hierarchical correlation ensembling, Scale-estimated deep networks, Tracking",

author = "Hanqiao Huang and Yamin Han and Peng Zhang and Wei Huang",

note = "Publisher Copyright: {\textcopyright} 2021",

year = "2021",

month = sep,

doi = "10.1016/j.displa.2021.102055",

language = "英语",

volume = "69",

journal = "Displays",

issn = "0141-9382",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Tracking based on scale-estimated deep networks with hierarchical correlation ensembling for cross-media understanding

AU - Huang, Hanqiao

AU - Han, Yamin

AU - Zhang, Peng

AU - Huang, Wei

PY - 2021/9

Y1 - 2021/9

N2 - In different vision based cross-media applications, the interest objects inside the visual regions usually need to be accurately localized/tracked to achieve more effective understanding and generating image descriptions (UGID), such as audio-visual lip recognition. Unfortunately, a robust tracking in realistic scenarios is usually challenged by the dynamic appearance variations when object motion is on-the-fly. Recent studies on deep neuron networks for the classification/recognition tasks have inspired a great progress in visual tracking, but the intrinsic assumption of scale invariance during target modeling still limited tracking performance to be further improved. Motivated by learning the object appearance with a scale estimation, in this study, a scale-estimated deep networks (SEN) is proposed to predict more accurate object size during tracking. By incorporating the proposed SEN into a hierarchical correlation ensembling framework, a joint translation-scale tracking scheme is accomplished to estimate the position and scale of the target object simultaneously. Substantial experiments on the challenging benchmark datasets have demonstrated that the proposed tracker is able to achieve the competitive results. Additionally, the performance evaluation of tracking lips also shows that the proposed work is also capable to support an audio-visual recognition task in different type of cross-media application.

AB - In different vision based cross-media applications, the interest objects inside the visual regions usually need to be accurately localized/tracked to achieve more effective understanding and generating image descriptions (UGID), such as audio-visual lip recognition. Unfortunately, a robust tracking in realistic scenarios is usually challenged by the dynamic appearance variations when object motion is on-the-fly. Recent studies on deep neuron networks for the classification/recognition tasks have inspired a great progress in visual tracking, but the intrinsic assumption of scale invariance during target modeling still limited tracking performance to be further improved. Motivated by learning the object appearance with a scale estimation, in this study, a scale-estimated deep networks (SEN) is proposed to predict more accurate object size during tracking. By incorporating the proposed SEN into a hierarchical correlation ensembling framework, a joint translation-scale tracking scheme is accomplished to estimate the position and scale of the target object simultaneously. Substantial experiments on the challenging benchmark datasets have demonstrated that the proposed tracker is able to achieve the competitive results. Additionally, the performance evaluation of tracking lips also shows that the proposed work is also capable to support an audio-visual recognition task in different type of cross-media application.

KW - Cross-media understanding

KW - Hierarchical correlation ensembling

KW - Scale-estimated deep networks

KW - Tracking

UR - http://www.scopus.com/inward/record.url?scp=85113635550&partnerID=8YFLogxK

U2 - 10.1016/j.displa.2021.102055

DO - 10.1016/j.displa.2021.102055

M3 - 文章

AN - SCOPUS:85113635550

SN - 0141-9382

VL - 69

JO - Displays

JF - Displays

M1 - 102055

ER -

Tracking based on scale-estimated deep networks with hierarchical correlation ensembling for cross-media understanding

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this