TY - JOUR
T1 - Tracking based on scale-estimated deep networks with hierarchical correlation ensembling for cross-media understanding
AU - Huang, Hanqiao
AU - Han, Yamin
AU - Zhang, Peng
AU - Huang, Wei
N1 - Publisher Copyright:
© 2021
PY - 2021/9
Y1 - 2021/9
N2 - In different vision based cross-media applications, the interest objects inside the visual regions usually need to be accurately localized/tracked to achieve more effective understanding and generating image descriptions (UGID), such as audio-visual lip recognition. Unfortunately, a robust tracking in realistic scenarios is usually challenged by the dynamic appearance variations when object motion is on-the-fly. Recent studies on deep neuron networks for the classification/recognition tasks have inspired a great progress in visual tracking, but the intrinsic assumption of scale invariance during target modeling still limited tracking performance to be further improved. Motivated by learning the object appearance with a scale estimation, in this study, a scale-estimated deep networks (SEN) is proposed to predict more accurate object size during tracking. By incorporating the proposed SEN into a hierarchical correlation ensembling framework, a joint translation-scale tracking scheme is accomplished to estimate the position and scale of the target object simultaneously. Substantial experiments on the challenging benchmark datasets have demonstrated that the proposed tracker is able to achieve the competitive results. Additionally, the performance evaluation of tracking lips also shows that the proposed work is also capable to support an audio-visual recognition task in different type of cross-media application.
AB - In different vision based cross-media applications, the interest objects inside the visual regions usually need to be accurately localized/tracked to achieve more effective understanding and generating image descriptions (UGID), such as audio-visual lip recognition. Unfortunately, a robust tracking in realistic scenarios is usually challenged by the dynamic appearance variations when object motion is on-the-fly. Recent studies on deep neuron networks for the classification/recognition tasks have inspired a great progress in visual tracking, but the intrinsic assumption of scale invariance during target modeling still limited tracking performance to be further improved. Motivated by learning the object appearance with a scale estimation, in this study, a scale-estimated deep networks (SEN) is proposed to predict more accurate object size during tracking. By incorporating the proposed SEN into a hierarchical correlation ensembling framework, a joint translation-scale tracking scheme is accomplished to estimate the position and scale of the target object simultaneously. Substantial experiments on the challenging benchmark datasets have demonstrated that the proposed tracker is able to achieve the competitive results. Additionally, the performance evaluation of tracking lips also shows that the proposed work is also capable to support an audio-visual recognition task in different type of cross-media application.
KW - Cross-media understanding
KW - Hierarchical correlation ensembling
KW - Scale-estimated deep networks
KW - Tracking
UR - http://www.scopus.com/inward/record.url?scp=85113635550&partnerID=8YFLogxK
U2 - 10.1016/j.displa.2021.102055
DO - 10.1016/j.displa.2021.102055
M3 - 文章
AN - SCOPUS:85113635550
SN - 0141-9382
VL - 69
JO - Displays
JF - Displays
M1 - 102055
ER -