ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding

Zoey Guo; Yiwen Tang; Ray Zhang; Dong Wang; Zhigang Wang; Bin Zhao; Xuelong Li

doi:10.1109/ICCV51070.2023.01410

ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding

Zoey Guo, Yiwen Tang, Ray Zhang, Dong Wang, Zhigang Wang, Bin Zhao, Xuelong Li

School of Artificial Intelligence, OPtics and Electronics

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

20 Scopus citations

Abstract

Understanding 3D scenes from multi-view inputs has been proven to alleviate the view discrepancy issue in 3D visual grounding. However, existing methods normally neglect the view cues embedded in the text modality and fail to weigh the relative importance of different views. In this paper, we propose ViewRefer, a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities. For the text branch, ViewRefer leverages the diverse linguistic knowledge of large-scale language models, e.g., GPT, to expand a single grounding text to multiple geometry-consistent descriptions. Meanwhile, in the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views. On top of that, we further present a set of learnable multi-view prototypes, which memorize scene-agnostic knowledge for different views, and enhance the framework from two perspectives: a view-guided attention module for more robust text features, and a view-guided scoring strategy during the final prediction. With our designed paradigm, ViewRefer achieves superior performance on three benchmarks and surpasses the second-best by +2.8%, +1.5%, and +1.35% on Sr3D, Nr3D, and ScanRefer.

Original language	English
Title of host publication	Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	15326-15337
Number of pages	12
ISBN (Electronic)	9798350307184
DOIs	https://doi.org/10.1109/ICCV51070.2023.01410
State	Published - 2023
Event	2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Paris, France Duration: 2 Oct 2023 → 6 Oct 2023

Publication series

Name	Proceedings of the IEEE International Conference on Computer Vision
ISSN (Print)	1550-5499

Conference

Conference	2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
Country/Territory	France
City	Paris
Period	2/10/23 → 6/10/23

Access to Document

10.1109/ICCV51070.2023.01410

Cite this

Guo, Z., Tang, Y., Zhang, R., Wang, D., Wang, Z., Zhao, B., & Li, X. (2023). ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding. In Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 (pp. 15326-15337). (Proceedings of the IEEE International Conference on Computer Vision). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICCV51070.2023.01410

@inproceedings{0e76002fd90943ab900d282006ba5393,

title = "ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding",

abstract = "Understanding 3D scenes from multi-view inputs has been proven to alleviate the view discrepancy issue in 3D visual grounding. However, existing methods normally neglect the view cues embedded in the text modality and fail to weigh the relative importance of different views. In this paper, we propose ViewRefer, a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities. For the text branch, ViewRefer leverages the diverse linguistic knowledge of large-scale language models, e.g., GPT, to expand a single grounding text to multiple geometry-consistent descriptions. Meanwhile, in the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views. On top of that, we further present a set of learnable multi-view prototypes, which memorize scene-agnostic knowledge for different views, and enhance the framework from two perspectives: a view-guided attention module for more robust text features, and a view-guided scoring strategy during the final prediction. With our designed paradigm, ViewRefer achieves superior performance on three benchmarks and surpasses the second-best by +2.8%, +1.5%, and +1.35% on Sr3D, Nr3D, and ScanRefer.",

author = "Zoey Guo and Yiwen Tang and Ray Zhang and Dong Wang and Zhigang Wang and Bin Zhao and Xuelong Li",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 ; Conference date: 02-10-2023 Through 06-10-2023",

year = "2023",

doi = "10.1109/ICCV51070.2023.01410",

language = "英语",

series = "Proceedings of the IEEE International Conference on Computer Vision",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "15326--15337",

booktitle = "Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023",

}

Guo, Z, Tang, Y, Zhang, R, Wang, D, Wang, Z, Zhao, B & Li, X 2023, ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding. in Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023. Proceedings of the IEEE International Conference on Computer Vision, Institute of Electrical and Electronics Engineers Inc., pp. 15326-15337, 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, 2/10/23. https://doi.org/10.1109/ICCV51070.2023.01410

ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding. / Guo, Zoey; Tang, Yiwen; Zhang, Ray et al.
Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023. Institute of Electrical and Electronics Engineers Inc., 2023. p. 15326-15337 (Proceedings of the IEEE International Conference on Computer Vision).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - ViewRefer

T2 - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023

AU - Guo, Zoey

AU - Tang, Yiwen

AU - Zhang, Ray

AU - Wang, Dong

AU - Wang, Zhigang

AU - Zhao, Bin

AU - Li, Xuelong

PY - 2023

Y1 - 2023

N2 - Understanding 3D scenes from multi-view inputs has been proven to alleviate the view discrepancy issue in 3D visual grounding. However, existing methods normally neglect the view cues embedded in the text modality and fail to weigh the relative importance of different views. In this paper, we propose ViewRefer, a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities. For the text branch, ViewRefer leverages the diverse linguistic knowledge of large-scale language models, e.g., GPT, to expand a single grounding text to multiple geometry-consistent descriptions. Meanwhile, in the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views. On top of that, we further present a set of learnable multi-view prototypes, which memorize scene-agnostic knowledge for different views, and enhance the framework from two perspectives: a view-guided attention module for more robust text features, and a view-guided scoring strategy during the final prediction. With our designed paradigm, ViewRefer achieves superior performance on three benchmarks and surpasses the second-best by +2.8%, +1.5%, and +1.35% on Sr3D, Nr3D, and ScanRefer.

AB - Understanding 3D scenes from multi-view inputs has been proven to alleviate the view discrepancy issue in 3D visual grounding. However, existing methods normally neglect the view cues embedded in the text modality and fail to weigh the relative importance of different views. In this paper, we propose ViewRefer, a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities. For the text branch, ViewRefer leverages the diverse linguistic knowledge of large-scale language models, e.g., GPT, to expand a single grounding text to multiple geometry-consistent descriptions. Meanwhile, in the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views. On top of that, we further present a set of learnable multi-view prototypes, which memorize scene-agnostic knowledge for different views, and enhance the framework from two perspectives: a view-guided attention module for more robust text features, and a view-guided scoring strategy during the final prediction. With our designed paradigm, ViewRefer achieves superior performance on three benchmarks and surpasses the second-best by +2.8%, +1.5%, and +1.35% on Sr3D, Nr3D, and ScanRefer.

UR - http://www.scopus.com/inward/record.url?scp=85185870719&partnerID=8YFLogxK

U2 - 10.1109/ICCV51070.2023.01410

DO - 10.1109/ICCV51070.2023.01410

M3 - 会议稿件

AN - SCOPUS:85185870719

T3 - Proceedings of the IEEE International Conference on Computer Vision

SP - 15326

EP - 15337

BT - Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 2 October 2023 through 6 October 2023

ER -

Guo Z, Tang Y, Zhang R, Wang D, Wang Z, Zhao B et al. ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding. In Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023. Institute of Electrical and Electronics Engineers Inc. 2023. p. 15326-15337. (Proceedings of the IEEE International Conference on Computer Vision). doi: 10.1109/ICCV51070.2023.01410

ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this