跳到主要导航 跳到搜索 跳到主要内容

RGS: A Unified Representation-Generation-Selection Framework for Knowledge-Based Visual Question Answering

  • Junbo Wang
  • , Xiangdong Nian
  • , Yuke Li
  • , Yining Zhu
  • , Hongsong Wang
  • , Jiangbin Zheng
  • , Zhiyong Wang
  • Northwestern Polytechnical University Xian
  • Southeast University, Nanjing
  • The University of Sydney

科研成果: 期刊稿件文章同行评审

摘要

Knowledge-based visual question answering (VQA) is a task that answers questions with additional knowledge beyond the image itself. Existing methods have either retrieved external knowledge bases to obtain explicit knowledge or utilized large language models (LLMs) to get implicit knowledge. However, it is a complicated pipeline to construct and retrieve these knowledge bases, which can introduce additional information unrelated to the question. In addition, LLM-based methods leverage image captions related to the question as contextual information to prompt LLMs, which suffer from insufficient contextual information and lack the key details necessary to answer the question. To address the issues, we propose a unified representation-generation-selection framework, named RGS, which first obtains multi-source captioning-fused image context representation based on multiple different question-aware image captioning models, then generates candidate answers via a two-stream in-context learning method combined with the image context representation, finally selects the best answer from the candidate answers by means of instruction tuning. To balance the trade-off between complexity and performance, we only additionally finetune a question-aware image captioning model named InstructCap and an optimal answer reasoning model named InstructJudge. Compared to most methods that rely on LLMs with over 100 billion parameters (e.g., GPT-3 - 175B), our approach leverages the 7B-parameter LLM Mistral-7B as implicit knowledge, achieving state-of-the-art performance on multiple knowledge-based VQA benchmarks and significantly outperforming several previous state-of-the-art methods.

源语言英语
期刊IEEE Transactions on Multimedia
DOI
出版状态已接受/待刊 - 2026

指纹

探究 'RGS: A Unified Representation-Generation-Selection Framework for Knowledge-Based Visual Question Answering' 的科研主题。它们共同构成独一无二的指纹。

引用此