Local-enhanced interaction for temporal moment localization

Guoqiang Liang; Shiyu Ji; Yanning Zhang

doi:10.1145/3460426.3463616

Local-enhanced interaction for temporal moment localization

Guoqiang Liang, Shiyu Ji, Yanning Zhang

School of Computer Science

Northwestern Polytechnical University Xian

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

3 Scopus citations

Abstract

Temporal moment localization via language aims to localize a video span in an untrimmed video which best matches the given natural language query. In most previous works, they try to match the whole query feature with multiple moment proposals, or match a global video embedding with phrase or word level query features. However, these coarse interaction models will become insufficient when the query-video contains more complex relationship. To address this issue, we propose a multi-branches interaction model for temporal moment localization. Specifically, the query sentence and video are encoded into multiple feature embeddings over several semantic sub-spaces. Then, each phrase embedding filters on a video feature to generate an attention sequence, which is used to re-weight the video features. Moreover, a dynamic pointer decoder is developed to iteratively regress the temporal boundary, which can prevent our model from falling into a local optimum. To validate the proposed method, we have conducted extensive experiments on two popular benchmark datasets Charade-STA and TACoS. The experimental performance surpasses other state-of-the-arts methods, which demonstrates the effectiveness of our proposed model.

Original language	English
Title of host publication	ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval
Publisher	Association for Computing Machinery, Inc
Pages	201-209
Number of pages	9
ISBN (Electronic)	9781450384636
DOIs	https://doi.org/10.1145/3460426.3463616
State	Published - 24 Aug 2021
Event	11th ACM International Conference on Multimedia Retrieval, ICMR 2021 - Taipei, Taiwan, Province of China Duration: 16 Nov 2021 → 19 Nov 2021

Publication series

Name	ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval

Conference

Conference	11th ACM International Conference on Multimedia Retrieval, ICMR 2021
Country/Territory	Taiwan, Province of China
City	Taipei
Period	16/11/21 → 19/11/21

Keywords

Dynamic pointer decoder
Multi-branches video-language interaction
Temporal moment localization

Access to Document

10.1145/3460426.3463616

Cite this

Liang, G., Ji, S., & Zhang, Y. (2021). Local-enhanced interaction for temporal moment localization. In ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval (pp. 201-209). (ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval). Association for Computing Machinery, Inc. https://doi.org/10.1145/3460426.3463616

@inproceedings{2cf9d145ea4342b98e98bd04cbf558fe,

title = "Local-enhanced interaction for temporal moment localization",

abstract = "Temporal moment localization via language aims to localize a video span in an untrimmed video which best matches the given natural language query. In most previous works, they try to match the whole query feature with multiple moment proposals, or match a global video embedding with phrase or word level query features. However, these coarse interaction models will become insufficient when the query-video contains more complex relationship. To address this issue, we propose a multi-branches interaction model for temporal moment localization. Specifically, the query sentence and video are encoded into multiple feature embeddings over several semantic sub-spaces. Then, each phrase embedding filters on a video feature to generate an attention sequence, which is used to re-weight the video features. Moreover, a dynamic pointer decoder is developed to iteratively regress the temporal boundary, which can prevent our model from falling into a local optimum. To validate the proposed method, we have conducted extensive experiments on two popular benchmark datasets Charade-STA and TACoS. The experimental performance surpasses other state-of-the-arts methods, which demonstrates the effectiveness of our proposed model.",

keywords = "Dynamic pointer decoder, Multi-branches video-language interaction, Temporal moment localization",

author = "Guoqiang Liang and Shiyu Ji and Yanning Zhang",

note = "Publisher Copyright: {\textcopyright} 2021 ACM.; 11th ACM International Conference on Multimedia Retrieval, ICMR 2021 ; Conference date: 16-11-2021 Through 19-11-2021",

year = "2021",

month = aug,

day = "24",

doi = "10.1145/3460426.3463616",

language = "英语",

series = "ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval",

publisher = "Association for Computing Machinery, Inc",

pages = "201--209",

booktitle = "ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval",

}

Liang, G, Ji, S & Zhang, Y 2021, Local-enhanced interaction for temporal moment localization. in ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval. ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval, Association for Computing Machinery, Inc, pp. 201-209, 11th ACM International Conference on Multimedia Retrieval, ICMR 2021, Taipei, Taiwan, Province of China, 16/11/21. https://doi.org/10.1145/3460426.3463616

Local-enhanced interaction for temporal moment localization. / Liang, Guoqiang; Ji, Shiyu; Zhang, Yanning.
ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval. Association for Computing Machinery, Inc, 2021. p. 201-209 (ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Local-enhanced interaction for temporal moment localization

AU - Liang, Guoqiang

AU - Ji, Shiyu

AU - Zhang, Yanning

PY - 2021/8/24

Y1 - 2021/8/24

N2 - Temporal moment localization via language aims to localize a video span in an untrimmed video which best matches the given natural language query. In most previous works, they try to match the whole query feature with multiple moment proposals, or match a global video embedding with phrase or word level query features. However, these coarse interaction models will become insufficient when the query-video contains more complex relationship. To address this issue, we propose a multi-branches interaction model for temporal moment localization. Specifically, the query sentence and video are encoded into multiple feature embeddings over several semantic sub-spaces. Then, each phrase embedding filters on a video feature to generate an attention sequence, which is used to re-weight the video features. Moreover, a dynamic pointer decoder is developed to iteratively regress the temporal boundary, which can prevent our model from falling into a local optimum. To validate the proposed method, we have conducted extensive experiments on two popular benchmark datasets Charade-STA and TACoS. The experimental performance surpasses other state-of-the-arts methods, which demonstrates the effectiveness of our proposed model.

AB - Temporal moment localization via language aims to localize a video span in an untrimmed video which best matches the given natural language query. In most previous works, they try to match the whole query feature with multiple moment proposals, or match a global video embedding with phrase or word level query features. However, these coarse interaction models will become insufficient when the query-video contains more complex relationship. To address this issue, we propose a multi-branches interaction model for temporal moment localization. Specifically, the query sentence and video are encoded into multiple feature embeddings over several semantic sub-spaces. Then, each phrase embedding filters on a video feature to generate an attention sequence, which is used to re-weight the video features. Moreover, a dynamic pointer decoder is developed to iteratively regress the temporal boundary, which can prevent our model from falling into a local optimum. To validate the proposed method, we have conducted extensive experiments on two popular benchmark datasets Charade-STA and TACoS. The experimental performance surpasses other state-of-the-arts methods, which demonstrates the effectiveness of our proposed model.

KW - Dynamic pointer decoder

KW - Multi-branches video-language interaction

KW - Temporal moment localization

UR - http://www.scopus.com/inward/record.url?scp=85114887311&partnerID=8YFLogxK

U2 - 10.1145/3460426.3463616

DO - 10.1145/3460426.3463616

M3 - 会议稿件

AN - SCOPUS:85114887311

T3 - ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval

SP - 201

EP - 209

BT - ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval

PB - Association for Computing Machinery, Inc

T2 - 11th ACM International Conference on Multimedia Retrieval, ICMR 2021

Y2 - 16 November 2021 through 19 November 2021

ER -

Local-enhanced interaction for temporal moment localization

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this