Cross-modal multi-relational graph reasoning: A novel model for multimodal textbook comprehension

Lingyun Song; Wenqing Du; Xiaolin Han; Xinbiao Gan; Xiaoqi Wang; Xuequn Shang

doi:10.1016/j.inffus.2025.103082

Cross-modal multi-relational graph reasoning: A novel model for multimodal textbook comprehension

Lingyun Song, Wenqing Du, Xiaolin Han, Xinbiao Gan, Xiaoqi Wang, Xuequn Shang

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

The ability to comprehensively understand multimodal textbook content is crucial for developing advanced intelligent tutoring systems and educational tools powered by generative AI. Earlier studies have advanced the understanding of multimodal content in educational by examining static cross-modal graphs that illustrate the relationships between visual objects and textual words. This, however, fails to account for the changes in relationship structures that characterize the visual-textual relationships in different cross-modal tasks. To tackle this issue, we present the Cross-Modal Multi-Relational Graph Reasoning (CMRGR) model. It is capable of analyzing a wide range of interactions between visual and textual components found in textbooks, allowing it to adapt its internal representation dynamically by utilizing contextual signals across different tasks. This capability is an indispensable asset for developing generative AI systems aimed at educational applications. We evaluate CMRGR's performance on three multimodal textbook datasets, demonstrating its superiority over state-of-the-art baselines in generating accurate classifications and answers.

源语言	英语
文章编号	103082
期刊	Information Fusion
卷	120
DOI	https://doi.org/10.1016/j.inffus.2025.103082
出版状态	已出版 - 8月 2025

访问文件

10.1016/j.inffus.2025.103082

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{80347f263c5a4b13b50cc4f40504cc69,

title = "Cross-modal multi-relational graph reasoning: A novel model for multimodal textbook comprehension",

abstract = "The ability to comprehensively understand multimodal textbook content is crucial for developing advanced intelligent tutoring systems and educational tools powered by generative AI. Earlier studies have advanced the understanding of multimodal content in educational by examining static cross-modal graphs that illustrate the relationships between visual objects and textual words. This, however, fails to account for the changes in relationship structures that characterize the visual-textual relationships in different cross-modal tasks. To tackle this issue, we present the Cross-Modal Multi-Relational Graph Reasoning (CMRGR) model. It is capable of analyzing a wide range of interactions between visual and textual components found in textbooks, allowing it to adapt its internal representation dynamically by utilizing contextual signals across different tasks. This capability is an indispensable asset for developing generative AI systems aimed at educational applications. We evaluate CMRGR's performance on three multimodal textbook datasets, demonstrating its superiority over state-of-the-art baselines in generating accurate classifications and answers.",

keywords = "Cross-modal reasoning, Graph neural network, Textbook question answering",

author = "Lingyun Song and Wenqing Du and Xiaolin Han and Xinbiao Gan and Xiaoqi Wang and Xuequn Shang",

note = "Publisher Copyright: {\textcopyright} 2025 Elsevier B.V.",

year = "2025",

month = aug,

doi = "10.1016/j.inffus.2025.103082",

language = "英语",

volume = "120",

journal = "Information Fusion",

issn = "1566-2535",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Cross-modal multi-relational graph reasoning

T2 - A novel model for multimodal textbook comprehension

AU - Song, Lingyun

AU - Du, Wenqing

AU - Han, Xiaolin

AU - Gan, Xinbiao

AU - Wang, Xiaoqi

AU - Shang, Xuequn

PY - 2025/8

Y1 - 2025/8

N2 - The ability to comprehensively understand multimodal textbook content is crucial for developing advanced intelligent tutoring systems and educational tools powered by generative AI. Earlier studies have advanced the understanding of multimodal content in educational by examining static cross-modal graphs that illustrate the relationships between visual objects and textual words. This, however, fails to account for the changes in relationship structures that characterize the visual-textual relationships in different cross-modal tasks. To tackle this issue, we present the Cross-Modal Multi-Relational Graph Reasoning (CMRGR) model. It is capable of analyzing a wide range of interactions between visual and textual components found in textbooks, allowing it to adapt its internal representation dynamically by utilizing contextual signals across different tasks. This capability is an indispensable asset for developing generative AI systems aimed at educational applications. We evaluate CMRGR's performance on three multimodal textbook datasets, demonstrating its superiority over state-of-the-art baselines in generating accurate classifications and answers.

AB - The ability to comprehensively understand multimodal textbook content is crucial for developing advanced intelligent tutoring systems and educational tools powered by generative AI. Earlier studies have advanced the understanding of multimodal content in educational by examining static cross-modal graphs that illustrate the relationships between visual objects and textual words. This, however, fails to account for the changes in relationship structures that characterize the visual-textual relationships in different cross-modal tasks. To tackle this issue, we present the Cross-Modal Multi-Relational Graph Reasoning (CMRGR) model. It is capable of analyzing a wide range of interactions between visual and textual components found in textbooks, allowing it to adapt its internal representation dynamically by utilizing contextual signals across different tasks. This capability is an indispensable asset for developing generative AI systems aimed at educational applications. We evaluate CMRGR's performance on three multimodal textbook datasets, demonstrating its superiority over state-of-the-art baselines in generating accurate classifications and answers.

KW - Cross-modal reasoning

KW - Graph neural network

KW - Textbook question answering

UR - http://www.scopus.com/inward/record.url?scp=105000054885&partnerID=8YFLogxK

U2 - 10.1016/j.inffus.2025.103082

DO - 10.1016/j.inffus.2025.103082

M3 - 文章

AN - SCOPUS:105000054885

SN - 1566-2535

VL - 120

JO - Information Fusion

JF - Information Fusion

M1 - 103082

ER -

Cross-modal multi-relational graph reasoning: A novel model for multimodal textbook comprehension

摘要

访问文件

其它文件与链接

指纹

引用此