TY - GEN
T1 - Local-enhanced interaction for temporal moment localization
AU - Liang, Guoqiang
AU - Ji, Shiyu
AU - Zhang, Yanning
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/8/24
Y1 - 2021/8/24
N2 - Temporal moment localization via language aims to localize a video span in an untrimmed video which best matches the given natural language query. In most previous works, they try to match the whole query feature with multiple moment proposals, or match a global video embedding with phrase or word level query features. However, these coarse interaction models will become insufficient when the query-video contains more complex relationship. To address this issue, we propose a multi-branches interaction model for temporal moment localization. Specifically, the query sentence and video are encoded into multiple feature embeddings over several semantic sub-spaces. Then, each phrase embedding filters on a video feature to generate an attention sequence, which is used to re-weight the video features. Moreover, a dynamic pointer decoder is developed to iteratively regress the temporal boundary, which can prevent our model from falling into a local optimum. To validate the proposed method, we have conducted extensive experiments on two popular benchmark datasets Charade-STA and TACoS. The experimental performance surpasses other state-of-the-arts methods, which demonstrates the effectiveness of our proposed model.
AB - Temporal moment localization via language aims to localize a video span in an untrimmed video which best matches the given natural language query. In most previous works, they try to match the whole query feature with multiple moment proposals, or match a global video embedding with phrase or word level query features. However, these coarse interaction models will become insufficient when the query-video contains more complex relationship. To address this issue, we propose a multi-branches interaction model for temporal moment localization. Specifically, the query sentence and video are encoded into multiple feature embeddings over several semantic sub-spaces. Then, each phrase embedding filters on a video feature to generate an attention sequence, which is used to re-weight the video features. Moreover, a dynamic pointer decoder is developed to iteratively regress the temporal boundary, which can prevent our model from falling into a local optimum. To validate the proposed method, we have conducted extensive experiments on two popular benchmark datasets Charade-STA and TACoS. The experimental performance surpasses other state-of-the-arts methods, which demonstrates the effectiveness of our proposed model.
KW - Dynamic pointer decoder
KW - Multi-branches video-language interaction
KW - Temporal moment localization
UR - http://www.scopus.com/inward/record.url?scp=85114887311&partnerID=8YFLogxK
U2 - 10.1145/3460426.3463616
DO - 10.1145/3460426.3463616
M3 - 会议稿件
AN - SCOPUS:85114887311
T3 - ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval
SP - 201
EP - 209
BT - ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval
PB - Association for Computing Machinery, Inc
T2 - 11th ACM International Conference on Multimedia Retrieval, ICMR 2021
Y2 - 16 November 2021 through 19 November 2021
ER -