Local-enhanced interaction for temporal moment localization

Guoqiang Liang; Shiyu Ji; Yanning Zhang

doi:10.1145/3460426.3463616

Local-enhanced interaction for temporal moment localization

Guoqiang Liang, Shiyu Ji, Yanning Zhang

计算机学院

Northwestern Polytechnical University Xian

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

3 引用（Scopus）

摘要

Temporal moment localization via language aims to localize a video span in an untrimmed video which best matches the given natural language query. In most previous works, they try to match the whole query feature with multiple moment proposals, or match a global video embedding with phrase or word level query features. However, these coarse interaction models will become insufficient when the query-video contains more complex relationship. To address this issue, we propose a multi-branches interaction model for temporal moment localization. Specifically, the query sentence and video are encoded into multiple feature embeddings over several semantic sub-spaces. Then, each phrase embedding filters on a video feature to generate an attention sequence, which is used to re-weight the video features. Moreover, a dynamic pointer decoder is developed to iteratively regress the temporal boundary, which can prevent our model from falling into a local optimum. To validate the proposed method, we have conducted extensive experiments on two popular benchmark datasets Charade-STA and TACoS. The experimental performance surpasses other state-of-the-arts methods, which demonstrates the effectiveness of our proposed model.

源语言	英语
主期刊名	ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval
出版商	Association for Computing Machinery, Inc
页	201-209
页数	9
ISBN（电子版）	9781450384636
DOI	https://doi.org/10.1145/3460426.3463616
出版状态	已出版 - 24 8月 2021
活动	11th ACM International Conference on Multimedia Retrieval, ICMR 2021 - Taipei, 中国台湾期限: 16 11月 2021 → 19 11月 2021

出版系列

姓名	ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval

会议

会议	11th ACM International Conference on Multimedia Retrieval, ICMR 2021
国家/地区	中国台湾
市	Taipei
时期	16/11/21 → 19/11/21

访问文件

10.1145/3460426.3463616

其它文件与链接

链接到 Scopus 的出版物

引用此

Liang, G., Ji, S., & Zhang, Y. (2021). Local-enhanced interaction for temporal moment localization. 在 ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval (页码 201-209). (ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval). Association for Computing Machinery, Inc. https://doi.org/10.1145/3460426.3463616

@inproceedings{2cf9d145ea4342b98e98bd04cbf558fe,

title = "Local-enhanced interaction for temporal moment localization",

abstract = "Temporal moment localization via language aims to localize a video span in an untrimmed video which best matches the given natural language query. In most previous works, they try to match the whole query feature with multiple moment proposals, or match a global video embedding with phrase or word level query features. However, these coarse interaction models will become insufficient when the query-video contains more complex relationship. To address this issue, we propose a multi-branches interaction model for temporal moment localization. Specifically, the query sentence and video are encoded into multiple feature embeddings over several semantic sub-spaces. Then, each phrase embedding filters on a video feature to generate an attention sequence, which is used to re-weight the video features. Moreover, a dynamic pointer decoder is developed to iteratively regress the temporal boundary, which can prevent our model from falling into a local optimum. To validate the proposed method, we have conducted extensive experiments on two popular benchmark datasets Charade-STA and TACoS. The experimental performance surpasses other state-of-the-arts methods, which demonstrates the effectiveness of our proposed model.",

keywords = "Dynamic pointer decoder, Multi-branches video-language interaction, Temporal moment localization",

author = "Guoqiang Liang and Shiyu Ji and Yanning Zhang",

note = "Publisher Copyright: {\textcopyright} 2021 ACM.; 11th ACM International Conference on Multimedia Retrieval, ICMR 2021 ; Conference date: 16-11-2021 Through 19-11-2021",

year = "2021",

month = aug,

day = "24",

doi = "10.1145/3460426.3463616",

language = "英语",

series = "ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval",

publisher = "Association for Computing Machinery, Inc",

pages = "201--209",

booktitle = "ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval",

}

Liang, G, Ji, S & Zhang, Y 2021, Local-enhanced interaction for temporal moment localization. 在 ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval. ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval, Association for Computing Machinery, Inc, 页码 201-209, 11th ACM International Conference on Multimedia Retrieval, ICMR 2021, Taipei, 中国台湾, 16/11/21. https://doi.org/10.1145/3460426.3463616

Local-enhanced interaction for temporal moment localization. / Liang, Guoqiang; Ji, Shiyu; Zhang, Yanning.
ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval. Association for Computing Machinery, Inc, 2021. 页码 201-209 (ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Local-enhanced interaction for temporal moment localization

AU - Liang, Guoqiang

AU - Ji, Shiyu

AU - Zhang, Yanning

PY - 2021/8/24

Y1 - 2021/8/24

N2 - Temporal moment localization via language aims to localize a video span in an untrimmed video which best matches the given natural language query. In most previous works, they try to match the whole query feature with multiple moment proposals, or match a global video embedding with phrase or word level query features. However, these coarse interaction models will become insufficient when the query-video contains more complex relationship. To address this issue, we propose a multi-branches interaction model for temporal moment localization. Specifically, the query sentence and video are encoded into multiple feature embeddings over several semantic sub-spaces. Then, each phrase embedding filters on a video feature to generate an attention sequence, which is used to re-weight the video features. Moreover, a dynamic pointer decoder is developed to iteratively regress the temporal boundary, which can prevent our model from falling into a local optimum. To validate the proposed method, we have conducted extensive experiments on two popular benchmark datasets Charade-STA and TACoS. The experimental performance surpasses other state-of-the-arts methods, which demonstrates the effectiveness of our proposed model.

AB - Temporal moment localization via language aims to localize a video span in an untrimmed video which best matches the given natural language query. In most previous works, they try to match the whole query feature with multiple moment proposals, or match a global video embedding with phrase or word level query features. However, these coarse interaction models will become insufficient when the query-video contains more complex relationship. To address this issue, we propose a multi-branches interaction model for temporal moment localization. Specifically, the query sentence and video are encoded into multiple feature embeddings over several semantic sub-spaces. Then, each phrase embedding filters on a video feature to generate an attention sequence, which is used to re-weight the video features. Moreover, a dynamic pointer decoder is developed to iteratively regress the temporal boundary, which can prevent our model from falling into a local optimum. To validate the proposed method, we have conducted extensive experiments on two popular benchmark datasets Charade-STA and TACoS. The experimental performance surpasses other state-of-the-arts methods, which demonstrates the effectiveness of our proposed model.

KW - Dynamic pointer decoder

KW - Multi-branches video-language interaction

KW - Temporal moment localization

UR - http://www.scopus.com/inward/record.url?scp=85114887311&partnerID=8YFLogxK

U2 - 10.1145/3460426.3463616

DO - 10.1145/3460426.3463616

M3 - 会议稿件

AN - SCOPUS:85114887311

T3 - ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval

SP - 201

EP - 209

BT - ICMR 2021 - Proceedings of the 2021 International Conference on Multimedia Retrieval

PB - Association for Computing Machinery, Inc

T2 - 11th ACM International Conference on Multimedia Retrieval, ICMR 2021

Y2 - 16 November 2021 through 19 November 2021

ER -

Local-enhanced interaction for temporal moment localization

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此