ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding

Zoey Guo; Yiwen Tang; Ray Zhang; Dong Wang; Zhigang Wang; Bin Zhao; Xuelong Li

doi:10.1109/ICCV51070.2023.01410

ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding

Zoey Guo, Yiwen Tang, Ray Zhang, Dong Wang, Zhigang Wang, Bin Zhao, Xuelong Li

光电与智能研究院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

20 引用（Scopus）

摘要

Understanding 3D scenes from multi-view inputs has been proven to alleviate the view discrepancy issue in 3D visual grounding. However, existing methods normally neglect the view cues embedded in the text modality and fail to weigh the relative importance of different views. In this paper, we propose ViewRefer, a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities. For the text branch, ViewRefer leverages the diverse linguistic knowledge of large-scale language models, e.g., GPT, to expand a single grounding text to multiple geometry-consistent descriptions. Meanwhile, in the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views. On top of that, we further present a set of learnable multi-view prototypes, which memorize scene-agnostic knowledge for different views, and enhance the framework from two perspectives: a view-guided attention module for more robust text features, and a view-guided scoring strategy during the final prediction. With our designed paradigm, ViewRefer achieves superior performance on three benchmarks and surpasses the second-best by +2.8%, +1.5%, and +1.35% on Sr3D, Nr3D, and ScanRefer.

源语言	英语
主期刊名	Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
出版商	Institute of Electrical and Electronics Engineers Inc.
页	15326-15337
页数	12
ISBN（电子版）	9798350307184
DOI	https://doi.org/10.1109/ICCV51070.2023.01410
出版状态	已出版 - 2023
活动	2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Paris, 法国期限: 2 10月 2023 → 6 10月 2023

出版系列

姓名	Proceedings of the IEEE International Conference on Computer Vision
ISSN（印刷版）	1550-5499

会议

会议	2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
国家/地区	法国
市	Paris
时期	2/10/23 → 6/10/23

访问文件

10.1109/ICCV51070.2023.01410

其它文件与链接

链接到 Scopus 的出版物

引用此

Guo, Z., Tang, Y., Zhang, R., Wang, D., Wang, Z., Zhao, B., & Li, X. (2023). ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding. 在 Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 (页码 15326-15337). (Proceedings of the IEEE International Conference on Computer Vision). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICCV51070.2023.01410

@inproceedings{0e76002fd90943ab900d282006ba5393,

title = "ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding",

abstract = "Understanding 3D scenes from multi-view inputs has been proven to alleviate the view discrepancy issue in 3D visual grounding. However, existing methods normally neglect the view cues embedded in the text modality and fail to weigh the relative importance of different views. In this paper, we propose ViewRefer, a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities. For the text branch, ViewRefer leverages the diverse linguistic knowledge of large-scale language models, e.g., GPT, to expand a single grounding text to multiple geometry-consistent descriptions. Meanwhile, in the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views. On top of that, we further present a set of learnable multi-view prototypes, which memorize scene-agnostic knowledge for different views, and enhance the framework from two perspectives: a view-guided attention module for more robust text features, and a view-guided scoring strategy during the final prediction. With our designed paradigm, ViewRefer achieves superior performance on three benchmarks and surpasses the second-best by +2.8%, +1.5%, and +1.35% on Sr3D, Nr3D, and ScanRefer.",

author = "Zoey Guo and Yiwen Tang and Ray Zhang and Dong Wang and Zhigang Wang and Bin Zhao and Xuelong Li",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 ; Conference date: 02-10-2023 Through 06-10-2023",

year = "2023",

doi = "10.1109/ICCV51070.2023.01410",

language = "英语",

series = "Proceedings of the IEEE International Conference on Computer Vision",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "15326--15337",

booktitle = "Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023",

}

Guo, Z, Tang, Y, Zhang, R, Wang, D, Wang, Z, Zhao, B & Li, X 2023, ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding. 在 Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023. Proceedings of the IEEE International Conference on Computer Vision, Institute of Electrical and Electronics Engineers Inc., 页码 15326-15337, 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, 法国, 2/10/23. https://doi.org/10.1109/ICCV51070.2023.01410

ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding. / Guo, Zoey; Tang, Yiwen; Zhang, Ray 等.
Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023. Institute of Electrical and Electronics Engineers Inc., 2023. 页码 15326-15337 (Proceedings of the IEEE International Conference on Computer Vision).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - ViewRefer

T2 - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023

AU - Guo, Zoey

AU - Tang, Yiwen

AU - Zhang, Ray

AU - Wang, Dong

AU - Wang, Zhigang

AU - Zhao, Bin

AU - Li, Xuelong

PY - 2023

Y1 - 2023

N2 - Understanding 3D scenes from multi-view inputs has been proven to alleviate the view discrepancy issue in 3D visual grounding. However, existing methods normally neglect the view cues embedded in the text modality and fail to weigh the relative importance of different views. In this paper, we propose ViewRefer, a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities. For the text branch, ViewRefer leverages the diverse linguistic knowledge of large-scale language models, e.g., GPT, to expand a single grounding text to multiple geometry-consistent descriptions. Meanwhile, in the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views. On top of that, we further present a set of learnable multi-view prototypes, which memorize scene-agnostic knowledge for different views, and enhance the framework from two perspectives: a view-guided attention module for more robust text features, and a view-guided scoring strategy during the final prediction. With our designed paradigm, ViewRefer achieves superior performance on three benchmarks and surpasses the second-best by +2.8%, +1.5%, and +1.35% on Sr3D, Nr3D, and ScanRefer.

AB - Understanding 3D scenes from multi-view inputs has been proven to alleviate the view discrepancy issue in 3D visual grounding. However, existing methods normally neglect the view cues embedded in the text modality and fail to weigh the relative importance of different views. In this paper, we propose ViewRefer, a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities. For the text branch, ViewRefer leverages the diverse linguistic knowledge of large-scale language models, e.g., GPT, to expand a single grounding text to multiple geometry-consistent descriptions. Meanwhile, in the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views. On top of that, we further present a set of learnable multi-view prototypes, which memorize scene-agnostic knowledge for different views, and enhance the framework from two perspectives: a view-guided attention module for more robust text features, and a view-guided scoring strategy during the final prediction. With our designed paradigm, ViewRefer achieves superior performance on three benchmarks and surpasses the second-best by +2.8%, +1.5%, and +1.35% on Sr3D, Nr3D, and ScanRefer.

UR - http://www.scopus.com/inward/record.url?scp=85185870719&partnerID=8YFLogxK

U2 - 10.1109/ICCV51070.2023.01410

DO - 10.1109/ICCV51070.2023.01410

M3 - 会议稿件

AN - SCOPUS:85185870719

T3 - Proceedings of the IEEE International Conference on Computer Vision

SP - 15326

EP - 15337

BT - Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 2 October 2023 through 6 October 2023

ER -

Guo Z, Tang Y, Zhang R, Wang D, Wang Z, Zhao B 等. ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding. 在 Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023. Institute of Electrical and Electronics Engineers Inc. 2023. 页码 15326-15337. (Proceedings of the IEEE International Conference on Computer Vision). doi: 10.1109/ICCV51070.2023.01410

ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此