“三维视觉—语言”推理技术的前沿研究与最新趋势

Yinjie Lei; Kai Xu; Yulan Guo; Xin Yang; Yuwei Wu; Wei Hu; Jiaqi Yang; Hanyun Wang

doi:10.11834/jig.240029

“三维视觉—语言”推理技术的前沿研究与最新趋势

Translated title of the contribution: Comprehensive survey on 3D visual-language understanding techniques

Yinjie Lei, Kai Xu, Yulan Guo, Xin Yang, Yuwei Wu, Wei Hu, Jiaqi Yang, Hanyun Wang

School of Computer Science

Research output: Contribution to journal › Article › peer-review

2 Scopus citations

Abstract

The core of 3D visual reasoning is to understand the relationships among different visual entities in point cloud scenes. Traditional 3D visual reasoning typically requires users to possess professional expertise. However, nonprofessional users face difficulty conveying their intentions to computers, which hinders the popularization and advancement of this technology. Users now anticipate a more convenient way to convey their intentions to the computer for achieving information exchange and gaining personalized results. Researchers utilize natural language as a semantic background or query criteria to reflect user intentions for addressing the aforementioned issue. They further accomplish various missions by interacting such natural language with 3D point clouds. By multimodal interaction, often employing techniques such as the Transformer or graph neural network, current approaches not only can locate the entities mentioned by users(e. g., visual grounding and open-vocabulary recognition)but also can generate user-required content(e. g., dense captioning, visual question answering, and scene generation). Specifically, 3D visual grounding is intended to locate desired objects or regions in the 3D point cloud scene based on the object-related linguistic query. Open-vocabulary 3D recognition aims to identify and localize 3D objects of novel classes defined by an unbounded(open)vocabulary at inference, which can generalize beyond the limited number of base classes labeled during the training phase. 3D dense captioning aims to identify all possible instances within the 3D point cloud scene and generate the corresponding natural language description for each instance. The goal of 3D visual question answering is to comprehend an entire 3D scene and provide an appropriate answer. Text-guided scene generation is to synthesize a realistic 3D scene composed of complex background and multiple objects from natural language descriptions. The aforementioned paradigm, which is known as 3D visual-language understanding, has gained significant traction in various fields, such as autonomous driving, robot navigation, and human-computer interaction, in recent years. Consequently, it has become a highly anticipated research direction within the computer vision domain. Over the past 3 years, 3D visual-language understanding technology has rapidly developed and showcased a blossoming trend. However, comprehensive summaries regarding the latest research progress remain lacking. Therefore, the necessary tasks are to systematically summarize recent studies, comprehensively evaluate the performance of different approaches, and prospectively point out future research directions. This situation motivates this survey to fill this gap. For this purpose, this study aims to focus on two of the most representative works of 3D visual-language understanding technologies and systematically summarizes their latest research advancements：anchor box prediction and content generation. First, the study provides an overview of the problem definition and existing challenges in 3D visual-language understanding, and it also outlines some common backbones used in this area. The challenges in 3D visual-language understanding include 3D-language alignment and complex scene understanding. Meanwhile, some common backbones involve priori rules, multilayer perceptrons, graph neural networks, and Transformer architectures. Subsequently, the study delves into downstream scenarios, which emphasize two types of 3D visual-language understanding techniques, including bounding box predation and content generation. This study thoroughly explores the advantages and disadvantages of each method. Furthermore, the study compares and analyzes the performance of various methods on different benchmark datasets. Finally, the study concludes by looking ahead to the future prospects of 3D visual-language reasoning technology, which can promote profound research and widespread application in this field. The major contributions of this study can be summarized as follows：1)Systematic survey of 3D visual-language understanding. To the best of our knowledge, this survey is the first to thoroughly discuss the recent advances in 3D visual-language understanding. We categorize algorithms into different taxonomies from the perspective of downstream scenarios to provide readers with a clear comprehension of our article. 2)Comprehensive performance evaluation and analysis. We compare the existing 3D visual-language understanding approaches on several publicly available datasets. Our in-depth analysis can help researchers in selecting the baseline suitable for their specific applications while also offering valuable insights on the modification of existing methods. 3)Insightful discussion of future prospects. Based on the systematic survey and comprehensive performance comparison, some promising future research directions are discussed, including large-scale 3D foundation model, computational efficiency of 3D modeling, and incorporation of additional modalities.

Translated title of the contribution	Comprehensive survey on 3D visual-language understanding techniques
Original language	Chinese (Traditional)
Pages (from-to)	1747-1764
Number of pages	18
Journal	Journal of Image and Graphics
Volume	29
Issue number	6
DOIs	https://doi.org/10.11834/jig.240029
State	Published - Jun 2024

Access to Document

10.11834/jig.240029

Cite this

@article{609a579de8ff4f178233cda65e51766d,

title = "“三维视觉—语言”推理技术的前沿研究与最新趋势",

abstract = "The core of 3D visual reasoning is to understand the relationships among different visual entities in point cloud scenes. Traditional 3D visual reasoning typically requires users to possess professional expertise. However, nonprofessional users face difficulty conveying their intentions to computers, which hinders the popularization and advancement of this technology. Users now anticipate a more convenient way to convey their intentions to the computer for achieving information exchange and gaining personalized results. Researchers utilize natural language as a semantic background or query criteria to reflect user intentions for addressing the aforementioned issue. They further accomplish various missions by interacting such natural language with 3D point clouds. By multimodal interaction, often employing techniques such as the Transformer or graph neural network, current approaches not only can locate the entities mentioned by users(e. g., visual grounding and open-vocabulary recognition)but also can generate user-required content(e. g., dense captioning, visual question answering, and scene generation). Specifically, 3D visual grounding is intended to locate desired objects or regions in the 3D point cloud scene based on the object-related linguistic query. Open-vocabulary 3D recognition aims to identify and localize 3D objects of novel classes defined by an unbounded(open)vocabulary at inference, which can generalize beyond the limited number of base classes labeled during the training phase. 3D dense captioning aims to identify all possible instances within the 3D point cloud scene and generate the corresponding natural language description for each instance. The goal of 3D visual question answering is to comprehend an entire 3D scene and provide an appropriate answer. Text-guided scene generation is to synthesize a realistic 3D scene composed of complex background and multiple objects from natural language descriptions. The aforementioned paradigm, which is known as 3D visual-language understanding, has gained significant traction in various fields, such as autonomous driving, robot navigation, and human-computer interaction, in recent years. Consequently, it has become a highly anticipated research direction within the computer vision domain. Over the past 3 years, 3D visual-language understanding technology has rapidly developed and showcased a blossoming trend. However, comprehensive summaries regarding the latest research progress remain lacking. Therefore, the necessary tasks are to systematically summarize recent studies, comprehensively evaluate the performance of different approaches, and prospectively point out future research directions. This situation motivates this survey to fill this gap. For this purpose, this study aims to focus on two of the most representative works of 3D visual-language understanding technologies and systematically summarizes their latest research advancements：anchor box prediction and content generation. First, the study provides an overview of the problem definition and existing challenges in 3D visual-language understanding, and it also outlines some common backbones used in this area. The challenges in 3D visual-language understanding include 3D-language alignment and complex scene understanding. Meanwhile, some common backbones involve priori rules, multilayer perceptrons, graph neural networks, and Transformer architectures. Subsequently, the study delves into downstream scenarios, which emphasize two types of 3D visual-language understanding techniques, including bounding box predation and content generation. This study thoroughly explores the advantages and disadvantages of each method. Furthermore, the study compares and analyzes the performance of various methods on different benchmark datasets. Finally, the study concludes by looking ahead to the future prospects of 3D visual-language reasoning technology, which can promote profound research and widespread application in this field. The major contributions of this study can be summarized as follows：1)Systematic survey of 3D visual-language understanding. To the best of our knowledge, this survey is the first to thoroughly discuss the recent advances in 3D visual-language understanding. We categorize algorithms into different taxonomies from the perspective of downstream scenarios to provide readers with a clear comprehension of our article. 2)Comprehensive performance evaluation and analysis. We compare the existing 3D visual-language understanding approaches on several publicly available datasets. Our in-depth analysis can help researchers in selecting the baseline suitable for their specific applications while also offering valuable insights on the modification of existing methods. 3)Insightful discussion of future prospects. Based on the systematic survey and comprehensive performance comparison, some promising future research directions are discussed, including large-scale 3D foundation model, computational efficiency of 3D modeling, and incorporation of additional modalities.",

keywords = "3D visual-language understanding, computer vision, cross-modal learning, deep learning, dense captioning, scene generation, visual grounding, visual question answering",

author = "Yinjie Lei and Kai Xu and Yulan Guo and Xin Yang and Yuwei Wu and Wei Hu and Jiaqi Yang and Hanyun Wang",

year = "2024",

month = jun,

doi = "10.11834/jig.240029",

language = "繁体中文",

volume = "29",

pages = "1747--1764",

journal = "Journal of Image and Graphics",

issn = "1006-8961",

publisher = "Editorial and Publishing Board of JIG",

number = "6",

}

TY - JOUR

T1 - “三维视觉—语言”推理技术的前沿研究与最新趋势

AU - Lei, Yinjie

AU - Xu, Kai

AU - Guo, Yulan

AU - Yang, Xin

AU - Wu, Yuwei

AU - Hu, Wei

AU - Yang, Jiaqi

AU - Wang, Hanyun

PY - 2024/6

Y1 - 2024/6

N2 - The core of 3D visual reasoning is to understand the relationships among different visual entities in point cloud scenes. Traditional 3D visual reasoning typically requires users to possess professional expertise. However, nonprofessional users face difficulty conveying their intentions to computers, which hinders the popularization and advancement of this technology. Users now anticipate a more convenient way to convey their intentions to the computer for achieving information exchange and gaining personalized results. Researchers utilize natural language as a semantic background or query criteria to reflect user intentions for addressing the aforementioned issue. They further accomplish various missions by interacting such natural language with 3D point clouds. By multimodal interaction, often employing techniques such as the Transformer or graph neural network, current approaches not only can locate the entities mentioned by users(e. g., visual grounding and open-vocabulary recognition)but also can generate user-required content(e. g., dense captioning, visual question answering, and scene generation). Specifically, 3D visual grounding is intended to locate desired objects or regions in the 3D point cloud scene based on the object-related linguistic query. Open-vocabulary 3D recognition aims to identify and localize 3D objects of novel classes defined by an unbounded(open)vocabulary at inference, which can generalize beyond the limited number of base classes labeled during the training phase. 3D dense captioning aims to identify all possible instances within the 3D point cloud scene and generate the corresponding natural language description for each instance. The goal of 3D visual question answering is to comprehend an entire 3D scene and provide an appropriate answer. Text-guided scene generation is to synthesize a realistic 3D scene composed of complex background and multiple objects from natural language descriptions. The aforementioned paradigm, which is known as 3D visual-language understanding, has gained significant traction in various fields, such as autonomous driving, robot navigation, and human-computer interaction, in recent years. Consequently, it has become a highly anticipated research direction within the computer vision domain. Over the past 3 years, 3D visual-language understanding technology has rapidly developed and showcased a blossoming trend. However, comprehensive summaries regarding the latest research progress remain lacking. Therefore, the necessary tasks are to systematically summarize recent studies, comprehensively evaluate the performance of different approaches, and prospectively point out future research directions. This situation motivates this survey to fill this gap. For this purpose, this study aims to focus on two of the most representative works of 3D visual-language understanding technologies and systematically summarizes their latest research advancements：anchor box prediction and content generation. First, the study provides an overview of the problem definition and existing challenges in 3D visual-language understanding, and it also outlines some common backbones used in this area. The challenges in 3D visual-language understanding include 3D-language alignment and complex scene understanding. Meanwhile, some common backbones involve priori rules, multilayer perceptrons, graph neural networks, and Transformer architectures. Subsequently, the study delves into downstream scenarios, which emphasize two types of 3D visual-language understanding techniques, including bounding box predation and content generation. This study thoroughly explores the advantages and disadvantages of each method. Furthermore, the study compares and analyzes the performance of various methods on different benchmark datasets. Finally, the study concludes by looking ahead to the future prospects of 3D visual-language reasoning technology, which can promote profound research and widespread application in this field. The major contributions of this study can be summarized as follows：1)Systematic survey of 3D visual-language understanding. To the best of our knowledge, this survey is the first to thoroughly discuss the recent advances in 3D visual-language understanding. We categorize algorithms into different taxonomies from the perspective of downstream scenarios to provide readers with a clear comprehension of our article. 2)Comprehensive performance evaluation and analysis. We compare the existing 3D visual-language understanding approaches on several publicly available datasets. Our in-depth analysis can help researchers in selecting the baseline suitable for their specific applications while also offering valuable insights on the modification of existing methods. 3)Insightful discussion of future prospects. Based on the systematic survey and comprehensive performance comparison, some promising future research directions are discussed, including large-scale 3D foundation model, computational efficiency of 3D modeling, and incorporation of additional modalities.

AB - The core of 3D visual reasoning is to understand the relationships among different visual entities in point cloud scenes. Traditional 3D visual reasoning typically requires users to possess professional expertise. However, nonprofessional users face difficulty conveying their intentions to computers, which hinders the popularization and advancement of this technology. Users now anticipate a more convenient way to convey their intentions to the computer for achieving information exchange and gaining personalized results. Researchers utilize natural language as a semantic background or query criteria to reflect user intentions for addressing the aforementioned issue. They further accomplish various missions by interacting such natural language with 3D point clouds. By multimodal interaction, often employing techniques such as the Transformer or graph neural network, current approaches not only can locate the entities mentioned by users(e. g., visual grounding and open-vocabulary recognition)but also can generate user-required content(e. g., dense captioning, visual question answering, and scene generation). Specifically, 3D visual grounding is intended to locate desired objects or regions in the 3D point cloud scene based on the object-related linguistic query. Open-vocabulary 3D recognition aims to identify and localize 3D objects of novel classes defined by an unbounded(open)vocabulary at inference, which can generalize beyond the limited number of base classes labeled during the training phase. 3D dense captioning aims to identify all possible instances within the 3D point cloud scene and generate the corresponding natural language description for each instance. The goal of 3D visual question answering is to comprehend an entire 3D scene and provide an appropriate answer. Text-guided scene generation is to synthesize a realistic 3D scene composed of complex background and multiple objects from natural language descriptions. The aforementioned paradigm, which is known as 3D visual-language understanding, has gained significant traction in various fields, such as autonomous driving, robot navigation, and human-computer interaction, in recent years. Consequently, it has become a highly anticipated research direction within the computer vision domain. Over the past 3 years, 3D visual-language understanding technology has rapidly developed and showcased a blossoming trend. However, comprehensive summaries regarding the latest research progress remain lacking. Therefore, the necessary tasks are to systematically summarize recent studies, comprehensively evaluate the performance of different approaches, and prospectively point out future research directions. This situation motivates this survey to fill this gap. For this purpose, this study aims to focus on two of the most representative works of 3D visual-language understanding technologies and systematically summarizes their latest research advancements：anchor box prediction and content generation. First, the study provides an overview of the problem definition and existing challenges in 3D visual-language understanding, and it also outlines some common backbones used in this area. The challenges in 3D visual-language understanding include 3D-language alignment and complex scene understanding. Meanwhile, some common backbones involve priori rules, multilayer perceptrons, graph neural networks, and Transformer architectures. Subsequently, the study delves into downstream scenarios, which emphasize two types of 3D visual-language understanding techniques, including bounding box predation and content generation. This study thoroughly explores the advantages and disadvantages of each method. Furthermore, the study compares and analyzes the performance of various methods on different benchmark datasets. Finally, the study concludes by looking ahead to the future prospects of 3D visual-language reasoning technology, which can promote profound research and widespread application in this field. The major contributions of this study can be summarized as follows：1)Systematic survey of 3D visual-language understanding. To the best of our knowledge, this survey is the first to thoroughly discuss the recent advances in 3D visual-language understanding. We categorize algorithms into different taxonomies from the perspective of downstream scenarios to provide readers with a clear comprehension of our article. 2)Comprehensive performance evaluation and analysis. We compare the existing 3D visual-language understanding approaches on several publicly available datasets. Our in-depth analysis can help researchers in selecting the baseline suitable for their specific applications while also offering valuable insights on the modification of existing methods. 3)Insightful discussion of future prospects. Based on the systematic survey and comprehensive performance comparison, some promising future research directions are discussed, including large-scale 3D foundation model, computational efficiency of 3D modeling, and incorporation of additional modalities.

KW - 3D visual-language understanding

KW - computer vision

KW - cross-modal learning

KW - deep learning

KW - dense captioning

KW - scene generation

KW - visual grounding

KW - visual question answering

UR - http://www.scopus.com/inward/record.url?scp=85196873369&partnerID=8YFLogxK

U2 - 10.11834/jig.240029

DO - 10.11834/jig.240029

M3 - 文章

AN - SCOPUS:85196873369

SN - 1006-8961

VL - 29

SP - 1747

EP - 1764

JO - Journal of Image and Graphics

JF - Journal of Image and Graphics

IS - 6

ER -

“三维视觉—语言”推理技术的前沿研究与最新趋势

Abstract

Access to Document

Other files and links

Fingerprint

Cite this