TY - JOUR
T1 - RGS
T2 - A Unified Representation-Generation-Selection Framework for Knowledge-Based Visual Question Answering
AU - Wang, Junbo
AU - Nian, Xiangdong
AU - Li, Yuke
AU - Zhu, Yining
AU - Wang, Hongsong
AU - Zheng, Jiangbin
AU - Wang, Zhiyong
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2026
Y1 - 2026
N2 - Knowledge-based visual question answering (VQA) is a task that answers questions with additional knowledge beyond the image itself. Existing methods have either retrieved external knowledge bases to obtain explicit knowledge or utilized large language models (LLMs) to get implicit knowledge. However, it is a complicated pipeline to construct and retrieve these knowledge bases, which can introduce additional information unrelated to the question. In addition, LLM-based methods leverage image captions related to the question as contextual information to prompt LLMs, which suffer from insufficient contextual information and lack the key details necessary to answer the question. To address the issues, we propose a unified representation-generation-selection framework, named RGS, which first obtains multi-source captioning-fused image context representation based on multiple different question-aware image captioning models, then generates candidate answers via a two-stream in-context learning method combined with the image context representation, finally selects the best answer from the candidate answers by means of instruction tuning. To balance the trade-off between complexity and performance, we only additionally finetune a question-aware image captioning model named InstructCap and an optimal answer reasoning model named InstructJudge. Compared to most methods that rely on LLMs with over 100 billion parameters (e.g., GPT-3 - 175B), our approach leverages the 7B-parameter LLM Mistral-7B as implicit knowledge, achieving state-of-the-art performance on multiple knowledge-based VQA benchmarks and significantly outperforming several previous state-of-the-art methods.
AB - Knowledge-based visual question answering (VQA) is a task that answers questions with additional knowledge beyond the image itself. Existing methods have either retrieved external knowledge bases to obtain explicit knowledge or utilized large language models (LLMs) to get implicit knowledge. However, it is a complicated pipeline to construct and retrieve these knowledge bases, which can introduce additional information unrelated to the question. In addition, LLM-based methods leverage image captions related to the question as contextual information to prompt LLMs, which suffer from insufficient contextual information and lack the key details necessary to answer the question. To address the issues, we propose a unified representation-generation-selection framework, named RGS, which first obtains multi-source captioning-fused image context representation based on multiple different question-aware image captioning models, then generates candidate answers via a two-stream in-context learning method combined with the image context representation, finally selects the best answer from the candidate answers by means of instruction tuning. To balance the trade-off between complexity and performance, we only additionally finetune a question-aware image captioning model named InstructCap and an optimal answer reasoning model named InstructJudge. Compared to most methods that rely on LLMs with over 100 billion parameters (e.g., GPT-3 - 175B), our approach leverages the 7B-parameter LLM Mistral-7B as implicit knowledge, achieving state-of-the-art performance on multiple knowledge-based VQA benchmarks and significantly outperforming several previous state-of-the-art methods.
KW - Image captioning
KW - Large language models
KW - Visual question answering
KW - Visual-language reasoning
UR - https://www.scopus.com/pages/publications/105037736111
U2 - 10.1109/TMM.2026.3689433
DO - 10.1109/TMM.2026.3689433
M3 - 文章
AN - SCOPUS:105037736111
SN - 1520-9210
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -