摘要
The ability to comprehensively understand multimodal textbook content is crucial for developing advanced intelligent tutoring systems and educational tools powered by generative AI. Earlier studies have advanced the understanding of multimodal content in educational by examining static cross-modal graphs that illustrate the relationships between visual objects and textual words. This, however, fails to account for the changes in relationship structures that characterize the visual-textual relationships in different cross-modal tasks. To tackle this issue, we present the Cross-Modal Multi-Relational Graph Reasoning (CMRGR) model. It is capable of analyzing a wide range of interactions between visual and textual components found in textbooks, allowing it to adapt its internal representation dynamically by utilizing contextual signals across different tasks. This capability is an indispensable asset for developing generative AI systems aimed at educational applications. We evaluate CMRGR's performance on three multimodal textbook datasets, demonstrating its superiority over state-of-the-art baselines in generating accurate classifications and answers.
源语言 | 英语 |
---|---|
文章编号 | 103082 |
期刊 | Information Fusion |
卷 | 120 |
DOI | |
出版状态 | 已出版 - 8月 2025 |