Adaptive Medical Topic Learning for Enhanced Fine-grained Cross-modal Alignment in Medical Report Generation

Xin Mei, Libin Yang, Denghong Gao, Xiaoyan Cai, Junwei Han, Tianming Liu

Research output: Contribution to journalArticlepeer-review

Abstract

Medical report generation refers to the automatic creation of accurate and coherent diagnostic reports for medical images. This task can alleviate the workload of radiologists, enhance the efficiency of disease diagnosis, and therefore holds significant value and challenges. Considering the feature differences between different modalities, existing methods primarily focus on facilitating medical report generation through cross-modal alignment of images and texts. However, since medical images are very similar to each other, it is difficult to tag obvious objects, making most methods limited to coarse-grained image-text global alignment. In this paper, we propose a medical report generation model based on adaptive topic learning and fine-grained cross-modal alignment, which aligns images and texts from medical topic perspective and token perspective. From the medical topic perspective, a global-local contrastive loss is introduced to adaptively learn efficient medical topic features, and medical topics are utilized to map images and texts to the same semantic space for fine-grained alignment. From the token perspective, a token prediction module is designed to enable the model to focus on important local information by predicting the key tokens contained in the report. Experimental results on the two public datasets (i.e. IU-Xray and MIMIC-CXR) demonstrate that our proposed model outperforms state-of-the-art baselines.

Original languageEnglish
JournalIEEE Transactions on Multimedia
DOIs
StateAccepted/In press - 2025

Keywords

  • Medical report generation
  • cross-modal alignment
  • deep learning
  • feature extraction

Fingerprint

Dive into the research topics of 'Adaptive Medical Topic Learning for Enhanced Fine-grained Cross-modal Alignment in Medical Report Generation'. Together they form a unique fingerprint.

Cite this