Bridging the Semantic Gap in Medical Visual Question Answering With Prompt Learning

Research output: Contribution to journalArticlepeer-review

5 Scopus citations

Abstract

Medical Visual Question Answering (Med-VQA) aims to answer questions regarding the content of medical images, crucial for enhancing diagnostics and education in healthcare. However, progress in this field is hindered by data scarcity due to the resource-intensive nature of medical data annotation. While existing Med-VQA approaches often rely on pre-training to mitigate this issue, bridging the semantic gap between pre-trained models and specific tasks remains a significant challenge. This paper presents the Dynamic Semantic-Adaptive Prompting (DSAP) framework, leveraging prompt learning to enhance model performance in Med-VQA. To this end, we introduce two prompting strategies: Semantic Alignment Prompting (SAP) and Dynamic Question-Aware Prompting (DQAP). SAP prompts multi-modal inputs during fine-tuning, reducing the semantic gap by aligning model outputs with domain-specific contexts. Simultaneously, DQAP enhances answer selection by leveraging grammatical relationships between questions and answers, thereby improving accuracy and relevance. The DSAP framework was pre-trained on three datasets—ROCO, MedICaT, and MIMIC-CXR—and comprehensively evaluated against 15 existing Med-VQA models on three public datasets: VQA-RAD, SLAKE, and PathVQA. Our results demonstrate a substantial performance improvement, with DSAP achieving a 1.9% enhancement in average results across benchmarks. These findings underscore DSAP’s effectiveness in addressing critical challenges in Med-VQA and suggest promising avenues for future developments in medical AI.

Original languageEnglish
Pages (from-to)4605-4616
Number of pages12
JournalIEEE Transactions on Medical Imaging
Volume44
Issue number11
DOIs
StatePublished - 2025

Keywords

  • Medical visual question answering
  • medical vision-language pre-training
  • prompt learning

Fingerprint

Dive into the research topics of 'Bridging the Semantic Gap in Medical Visual Question Answering With Prompt Learning'. Together they form a unique fingerprint.

Cite this