Skip to main navigation Skip to search Skip to main content

RGS: A Unified Representation-Generation-Selection Framework for Knowledge-Based Visual Question Answering

  • Junbo Wang
  • , Xiangdong Nian
  • , Yuke Li
  • , Yining Zhu
  • , Hongsong Wang
  • , Jiangbin Zheng
  • , Zhiyong Wang
  • Northwestern Polytechnical University Xian
  • Southeast University, Nanjing
  • The University of Sydney

Research output: Contribution to journalArticlepeer-review

Abstract

Knowledge-based visual question answering (VQA) is a task that answers questions with additional knowledge beyond the image itself. Existing methods have either retrieved external knowledge bases to obtain explicit knowledge or utilized large language models (LLMs) to get implicit knowledge. However, it is a complicated pipeline to construct and retrieve these knowledge bases, which can introduce additional information unrelated to the question. In addition, LLM-based methods leverage image captions related to the question as contextual information to prompt LLMs, which suffer from insufficient contextual information and lack the key details necessary to answer the question. To address the issues, we propose a unified representation-generation-selection framework, named RGS, which first obtains multi-source captioning-fused image context representation based on multiple different question-aware image captioning models, then generates candidate answers via a two-stream in-context learning method combined with the image context representation, finally selects the best answer from the candidate answers by means of instruction tuning. To balance the trade-off between complexity and performance, we only additionally finetune a question-aware image captioning model named InstructCap and an optimal answer reasoning model named InstructJudge. Compared to most methods that rely on LLMs with over 100 billion parameters (e.g., GPT-3 - 175B), our approach leverages the 7B-parameter LLM Mistral-7B as implicit knowledge, achieving state-of-the-art performance on multiple knowledge-based VQA benchmarks and significantly outperforming several previous state-of-the-art methods.

Original languageEnglish
JournalIEEE Transactions on Multimedia
DOIs
StateAccepted/In press - 2026

Keywords

  • Image captioning
  • Large language models
  • Visual question answering
  • Visual-language reasoning

Fingerprint

Dive into the research topics of 'RGS: A Unified Representation-Generation-Selection Framework for Knowledge-Based Visual Question Answering'. Together they form a unique fingerprint.

Cite this