Mono3DVG: 3D Visual Grounding in Monocular Images

Yang Zhan; Yuan Yuan; Zhitong Xiong

doi:10.1609/aaai.v38i7.28525

Mono3DVG: 3D Visual Grounding in Monocular Images

Yang Zhan, Yuan Yuan, Zhitong Xiong

School of Artificial Intelligence, OPtics and Electronics

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

5 Scopus citations

Abstract

We introduce a novel task of 3D visual grounding in monocular RGB images using language descriptions with both appearance and geometry information. Specifically, we build a large-scale dataset, Mono3DRefer, which contains 3D object targets with their corresponding geometric text descriptions, generated by ChatGPT and refined manually. To foster this task, we propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings for multi-modal learning and 3D object localization. Depth predictor is designed to explicitly learn geometry features. The dual text-guided adapter is proposed to refine multiscale visual and geometry features of the referred object. Based on depth-text-visual stacking attention, the decoder fuses object-level geometric cues and visual appearance into a learnable query. Comprehensive benchmarks and some insightful analyses are provided for Mono3DVG. Extensive comparisons and ablation studies show that our method significantly outperforms all baselines. The dataset and code will be released.

Original language	English
Title of host publication	Technical Tracks 14
Editors	Michael Wooldridge, Jennifer Dy, Sriraam Natarajan
Publisher	Association for the Advancement of Artificial Intelligence
Pages	6988-6996
Number of pages	9
Edition	7
ISBN (Electronic)	1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879
DOIs	https://doi.org/10.1609/aaai.v38i7.28525
State	Published - 25 Mar 2024
Event	38th AAAI Conference on Artificial Intelligence, AAAI 2024 - Vancouver, Canada Duration: 20 Feb 2024 → 27 Feb 2024

Publication series

Name	Proceedings of the AAAI Conference on Artificial Intelligence
Number	7
Volume	38
ISSN (Print)	2159-5399
ISSN (Electronic)	2374-3468

Conference

Conference	38th AAAI Conference on Artificial Intelligence, AAAI 2024
Country/Territory	Canada
City	Vancouver
Period	20/02/24 → 27/02/24

Access to Document

10.1609/aaai.v38i7.28525

Cite this

@inproceedings{8ac83bd548ea472d9ac9b64a86783183,

title = "Mono3DVG: 3D Visual Grounding in Monocular Images",

abstract = "We introduce a novel task of 3D visual grounding in monocular RGB images using language descriptions with both appearance and geometry information. Specifically, we build a large-scale dataset, Mono3DRefer, which contains 3D object targets with their corresponding geometric text descriptions, generated by ChatGPT and refined manually. To foster this task, we propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings for multi-modal learning and 3D object localization. Depth predictor is designed to explicitly learn geometry features. The dual text-guided adapter is proposed to refine multiscale visual and geometry features of the referred object. Based on depth-text-visual stacking attention, the decoder fuses object-level geometric cues and visual appearance into a learnable query. Comprehensive benchmarks and some insightful analyses are provided for Mono3DVG. Extensive comparisons and ablation studies show that our method significantly outperforms all baselines. The dataset and code will be released.",

author = "Yang Zhan and Yuan Yuan and Zhitong Xiong",

note = "Publisher Copyright: Copyright {\textcopyright} 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.; 38th AAAI Conference on Artificial Intelligence, AAAI 2024 ; Conference date: 20-02-2024 Through 27-02-2024",

year = "2024",

month = mar,

day = "25",

doi = "10.1609/aaai.v38i7.28525",

language = "英语",

series = "Proceedings of the AAAI Conference on Artificial Intelligence",

publisher = "Association for the Advancement of Artificial Intelligence",

number = "7",

pages = "6988--6996",

editor = "Michael Wooldridge and Jennifer Dy and Sriraam Natarajan",

booktitle = "Technical Tracks 14",

edition = "7",

}

Zhan, Y, Yuan, Y & Xiong, Z 2024, Mono3DVG: 3D Visual Grounding in Monocular Images. in M Wooldridge, J Dy & S Natarajan (eds), Technical Tracks 14. 7 edn, Proceedings of the AAAI Conference on Artificial Intelligence, no. 7, vol. 38, Association for the Advancement of Artificial Intelligence, pp. 6988-6996, 38th AAAI Conference on Artificial Intelligence, AAAI 2024, Vancouver, Canada, 20/02/24. https://doi.org/10.1609/aaai.v38i7.28525

Mono3DVG: 3D Visual Grounding in Monocular Images. / Zhan, Yang; Yuan, Yuan; Xiong, Zhitong.
Technical Tracks 14. ed. / Michael Wooldridge; Jennifer Dy; Sriraam Natarajan. 7. ed. Association for the Advancement of Artificial Intelligence, 2024. p. 6988-6996 (Proceedings of the AAAI Conference on Artificial Intelligence; Vol. 38, No. 7).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Mono3DVG

T2 - 38th AAAI Conference on Artificial Intelligence, AAAI 2024

AU - Zhan, Yang

AU - Yuan, Yuan

AU - Xiong, Zhitong

PY - 2024/3/25

Y1 - 2024/3/25

N2 - We introduce a novel task of 3D visual grounding in monocular RGB images using language descriptions with both appearance and geometry information. Specifically, we build a large-scale dataset, Mono3DRefer, which contains 3D object targets with their corresponding geometric text descriptions, generated by ChatGPT and refined manually. To foster this task, we propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings for multi-modal learning and 3D object localization. Depth predictor is designed to explicitly learn geometry features. The dual text-guided adapter is proposed to refine multiscale visual and geometry features of the referred object. Based on depth-text-visual stacking attention, the decoder fuses object-level geometric cues and visual appearance into a learnable query. Comprehensive benchmarks and some insightful analyses are provided for Mono3DVG. Extensive comparisons and ablation studies show that our method significantly outperforms all baselines. The dataset and code will be released.

AB - We introduce a novel task of 3D visual grounding in monocular RGB images using language descriptions with both appearance and geometry information. Specifically, we build a large-scale dataset, Mono3DRefer, which contains 3D object targets with their corresponding geometric text descriptions, generated by ChatGPT and refined manually. To foster this task, we propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings for multi-modal learning and 3D object localization. Depth predictor is designed to explicitly learn geometry features. The dual text-guided adapter is proposed to refine multiscale visual and geometry features of the referred object. Based on depth-text-visual stacking attention, the decoder fuses object-level geometric cues and visual appearance into a learnable query. Comprehensive benchmarks and some insightful analyses are provided for Mono3DVG. Extensive comparisons and ablation studies show that our method significantly outperforms all baselines. The dataset and code will be released.

UR - http://www.scopus.com/inward/record.url?scp=85189559856&partnerID=8YFLogxK

U2 - 10.1609/aaai.v38i7.28525

DO - 10.1609/aaai.v38i7.28525

M3 - 会议稿件

AN - SCOPUS:85189559856

T3 - Proceedings of the AAAI Conference on Artificial Intelligence

SP - 6988

EP - 6996

BT - Technical Tracks 14

A2 - Wooldridge, Michael

A2 - Dy, Jennifer

A2 - Natarajan, Sriraam

PB - Association for the Advancement of Artificial Intelligence

Y2 - 20 February 2024 through 27 February 2024

ER -

Mono3DVG: 3D Visual Grounding in Monocular Images

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this