Bridging the Semantic Gap in Medical Visual Question Answering with Prompt Learning

Zilin Lu, Qingjie Zeng, Mengkang Lu, Geng Chen, Yong Xia

Research output: Contribution to journalArticlepeer-review

Abstract

Medical Visual Question Answering (Med-VQA) aims to answer questions regarding the content of medical images, crucial for enhancing diagnostics and education in healthcare. However, progress in this field is hindered by data scarcity due to the resource-intensive nature of medical data annotation. While existing Med-VQA approaches often rely on pre-training to mitigate this issue, bridging the semantic gap between pre-trained models and specific tasks remains a significant challenge. This paper presents the Dynamic Semantic-Adaptive Prompting (DSAP) framework, leveraging prompt learning to enhance model performance in Med-VQA. To this end, we introduce two prompting strategies: Semantic Alignment Prompting (SAP) and Dynamic Question-Aware Prompting (DQAP). SAP prompts multi-modal inputs during fine-tuning, reducing the semantic gap by aligning model outputs with domain-specific contexts. Simultaneously, DQAP enhances answer selection by leveraging grammatical relationships between questions and answers, thereby improving accuracy and relevance. The DSAP framework was pre-trained on three datasets—ROCO, MedICaT, and MIMIC-CXR—and comprehensively evaluated against 15 existing Med-VQA models on three public datasets: VQA-RAD, SLAKE, and PathVQA. Our results demonstrate a substantial performance improvement, with DSAP achieving a 1.9% enhancement in average results across benchmarks. These findings underscore DSAP’s effectiveness in addressing critical challenges in Med-VQA and suggest promising avenues for future developments in medical AI.

Original languageEnglish
JournalIEEE Transactions on Medical Imaging
DOIs
StateAccepted/In press - 2025

Keywords

  • medical vision-language pre-training
  • Medical visual question answering
  • prompt learning

Fingerprint

Dive into the research topics of 'Bridging the Semantic Gap in Medical Visual Question Answering with Prompt Learning'. Together they form a unique fingerprint.

Cite this