Mono3DVG: 3D Visual Grounding in Monocular Images

Yang Zhan; Yuan Yuan; Zhitong Xiong

doi:10.1609/aaai.v38i7.28525

Mono3DVG: 3D Visual Grounding in Monocular Images

Yang Zhan, Yuan Yuan, Zhitong Xiong

光电与智能研究院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

5 引用（Scopus）

摘要

We introduce a novel task of 3D visual grounding in monocular RGB images using language descriptions with both appearance and geometry information. Specifically, we build a large-scale dataset, Mono3DRefer, which contains 3D object targets with their corresponding geometric text descriptions, generated by ChatGPT and refined manually. To foster this task, we propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings for multi-modal learning and 3D object localization. Depth predictor is designed to explicitly learn geometry features. The dual text-guided adapter is proposed to refine multiscale visual and geometry features of the referred object. Based on depth-text-visual stacking attention, the decoder fuses object-level geometric cues and visual appearance into a learnable query. Comprehensive benchmarks and some insightful analyses are provided for Mono3DVG. Extensive comparisons and ablation studies show that our method significantly outperforms all baselines. The dataset and code will be released.

源语言	英语
主期刊名	Technical Tracks 14
编辑	Michael Wooldridge, Jennifer Dy, Sriraam Natarajan
出版商	Association for the Advancement of Artificial Intelligence
页	6988-6996
页数	9
版本	7
ISBN（电子版）	1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879
DOI	https://doi.org/10.1609/aaai.v38i7.28525
出版状态	已出版 - 25 3月 2024
活动	38th AAAI Conference on Artificial Intelligence, AAAI 2024 - Vancouver, 加拿大期限: 20 2月 2024 → 27 2月 2024

出版系列

姓名	Proceedings of the AAAI Conference on Artificial Intelligence
编号	7
卷	38
ISSN（印刷版）	2159-5399
ISSN（电子版）	2374-3468

会议

会议	38th AAAI Conference on Artificial Intelligence, AAAI 2024
国家/地区	加拿大
市	Vancouver
时期	20/02/24 → 27/02/24

访问文件

10.1609/aaai.v38i7.28525

其它文件与链接

链接到 Scopus 的出版物

引用此

Zhan, Y., Yuan, Y., & Xiong, Z. (2024). Mono3DVG: 3D Visual Grounding in Monocular Images. 在 M. Wooldridge, J. Dy, & S. Natarajan (编辑), Technical Tracks 14 (7 编辑, 页码 6988-6996). (Proceedings of the AAAI Conference on Artificial Intelligence; 卷 38, 号码 7). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v38i7.28525

@inproceedings{8ac83bd548ea472d9ac9b64a86783183,

title = "Mono3DVG: 3D Visual Grounding in Monocular Images",

abstract = "We introduce a novel task of 3D visual grounding in monocular RGB images using language descriptions with both appearance and geometry information. Specifically, we build a large-scale dataset, Mono3DRefer, which contains 3D object targets with their corresponding geometric text descriptions, generated by ChatGPT and refined manually. To foster this task, we propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings for multi-modal learning and 3D object localization. Depth predictor is designed to explicitly learn geometry features. The dual text-guided adapter is proposed to refine multiscale visual and geometry features of the referred object. Based on depth-text-visual stacking attention, the decoder fuses object-level geometric cues and visual appearance into a learnable query. Comprehensive benchmarks and some insightful analyses are provided for Mono3DVG. Extensive comparisons and ablation studies show that our method significantly outperforms all baselines. The dataset and code will be released.",

author = "Yang Zhan and Yuan Yuan and Zhitong Xiong",

note = "Publisher Copyright: Copyright {\textcopyright} 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.; 38th AAAI Conference on Artificial Intelligence, AAAI 2024 ; Conference date: 20-02-2024 Through 27-02-2024",

year = "2024",

month = mar,

day = "25",

doi = "10.1609/aaai.v38i7.28525",

language = "英语",

series = "Proceedings of the AAAI Conference on Artificial Intelligence",

publisher = "Association for the Advancement of Artificial Intelligence",

number = "7",

pages = "6988--6996",

editor = "Michael Wooldridge and Jennifer Dy and Sriraam Natarajan",

booktitle = "Technical Tracks 14",

edition = "7",

}

Zhan, Y, Yuan, Y & Xiong, Z 2024, Mono3DVG: 3D Visual Grounding in Monocular Images. 在 M Wooldridge, J Dy & S Natarajan (编辑), Technical Tracks 14. 7 编辑, Proceedings of the AAAI Conference on Artificial Intelligence, 号码 7, 卷 38, Association for the Advancement of Artificial Intelligence, 页码 6988-6996, 38th AAAI Conference on Artificial Intelligence, AAAI 2024, Vancouver, 加拿大, 20/02/24. https://doi.org/10.1609/aaai.v38i7.28525

Mono3DVG: 3D Visual Grounding in Monocular Images. / Zhan, Yang; Yuan, Yuan; Xiong, Zhitong.
Technical Tracks 14. 编辑 / Michael Wooldridge; Jennifer Dy; Sriraam Natarajan. 7. 编辑 Association for the Advancement of Artificial Intelligence, 2024. 页码 6988-6996 (Proceedings of the AAAI Conference on Artificial Intelligence; 卷 38, 号码 7).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Mono3DVG

T2 - 38th AAAI Conference on Artificial Intelligence, AAAI 2024

AU - Zhan, Yang

AU - Yuan, Yuan

AU - Xiong, Zhitong

PY - 2024/3/25

Y1 - 2024/3/25

N2 - We introduce a novel task of 3D visual grounding in monocular RGB images using language descriptions with both appearance and geometry information. Specifically, we build a large-scale dataset, Mono3DRefer, which contains 3D object targets with their corresponding geometric text descriptions, generated by ChatGPT and refined manually. To foster this task, we propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings for multi-modal learning and 3D object localization. Depth predictor is designed to explicitly learn geometry features. The dual text-guided adapter is proposed to refine multiscale visual and geometry features of the referred object. Based on depth-text-visual stacking attention, the decoder fuses object-level geometric cues and visual appearance into a learnable query. Comprehensive benchmarks and some insightful analyses are provided for Mono3DVG. Extensive comparisons and ablation studies show that our method significantly outperforms all baselines. The dataset and code will be released.

AB - We introduce a novel task of 3D visual grounding in monocular RGB images using language descriptions with both appearance and geometry information. Specifically, we build a large-scale dataset, Mono3DRefer, which contains 3D object targets with their corresponding geometric text descriptions, generated by ChatGPT and refined manually. To foster this task, we propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings for multi-modal learning and 3D object localization. Depth predictor is designed to explicitly learn geometry features. The dual text-guided adapter is proposed to refine multiscale visual and geometry features of the referred object. Based on depth-text-visual stacking attention, the decoder fuses object-level geometric cues and visual appearance into a learnable query. Comprehensive benchmarks and some insightful analyses are provided for Mono3DVG. Extensive comparisons and ablation studies show that our method significantly outperforms all baselines. The dataset and code will be released.

UR - http://www.scopus.com/inward/record.url?scp=85189559856&partnerID=8YFLogxK

U2 - 10.1609/aaai.v38i7.28525

DO - 10.1609/aaai.v38i7.28525

M3 - 会议稿件

AN - SCOPUS:85189559856

T3 - Proceedings of the AAAI Conference on Artificial Intelligence

SP - 6988

EP - 6996

BT - Technical Tracks 14

A2 - Wooldridge, Michael

A2 - Dy, Jennifer

A2 - Natarajan, Sriraam

PB - Association for the Advancement of Artificial Intelligence

Y2 - 20 February 2024 through 27 February 2024

ER -

Mono3DVG: 3D Visual Grounding in Monocular Images

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此