Adaptive Medical Topic Learning for Enhanced Fine-grained Cross-modal Alignment in Medical Report Generation

Xin Mei; Libin Yang; Denghong Gao; Xiaoyan Cai; Junwei Han; Tianming Liu

doi:10.1109/TMM.2025.3543101

Adaptive Medical Topic Learning for Enhanced Fine-grained Cross-modal Alignment in Medical Report Generation

Xin Mei, Libin Yang, Denghong Gao, Xiaoyan Cai, Junwei Han, Tianming Liu

School of Automation

Research output: Contribution to journal › Article › peer-review

Abstract

Medical report generation refers to the automatic creation of accurate and coherent diagnostic reports for medical images. This task can alleviate the workload of radiologists, enhance the efficiency of disease diagnosis, and therefore holds significant value and challenges. Considering the feature differences between different modalities, existing methods primarily focus on facilitating medical report generation through cross-modal alignment of images and texts. However, since medical images are very similar to each other, it is difficult to tag obvious objects, making most methods limited to coarse-grained image-text global alignment. In this paper, we propose a medical report generation model based on adaptive topic learning and fine-grained cross-modal alignment, which aligns images and texts from medical topic perspective and token perspective. From the medical topic perspective, a global-local contrastive loss is introduced to adaptively learn efficient medical topic features, and medical topics are utilized to map images and texts to the same semantic space for fine-grained alignment. From the token perspective, a token prediction module is designed to enable the model to focus on important local information by predicting the key tokens contained in the report. Experimental results on the two public datasets (i.e. IU-Xray and MIMIC-CXR) demonstrate that our proposed model outperforms state-of-the-art baselines.

Original language	English
Journal	IEEE Transactions on Multimedia
DOIs	https://doi.org/10.1109/TMM.2025.3543101
State	Accepted/In press - 2025

Keywords

Medical report generation
cross-modal alignment
deep learning
feature extraction

Access to Document

10.1109/TMM.2025.3543101

Cite this

@article{4a2e6c273ecc4ff1be77e5286b7fa984,

title = "Adaptive Medical Topic Learning for Enhanced Fine-grained Cross-modal Alignment in Medical Report Generation",

abstract = "Medical report generation refers to the automatic creation of accurate and coherent diagnostic reports for medical images. This task can alleviate the workload of radiologists, enhance the efficiency of disease diagnosis, and therefore holds significant value and challenges. Considering the feature differences between different modalities, existing methods primarily focus on facilitating medical report generation through cross-modal alignment of images and texts. However, since medical images are very similar to each other, it is difficult to tag obvious objects, making most methods limited to coarse-grained image-text global alignment. In this paper, we propose a medical report generation model based on adaptive topic learning and fine-grained cross-modal alignment, which aligns images and texts from medical topic perspective and token perspective. From the medical topic perspective, a global-local contrastive loss is introduced to adaptively learn efficient medical topic features, and medical topics are utilized to map images and texts to the same semantic space for fine-grained alignment. From the token perspective, a token prediction module is designed to enable the model to focus on important local information by predicting the key tokens contained in the report. Experimental results on the two public datasets (i.e. IU-Xray and MIMIC-CXR) demonstrate that our proposed model outperforms state-of-the-art baselines.",

keywords = "Medical report generation, cross-modal alignment, deep learning, feature extraction",

author = "Xin Mei and Libin Yang and Denghong Gao and Xiaoyan Cai and Junwei Han and Tianming Liu",

note = "Publisher Copyright: {\textcopyright} 2025 IEEE.",

year = "2025",

doi = "10.1109/TMM.2025.3543101",

language = "英语",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Adaptive Medical Topic Learning for Enhanced Fine-grained Cross-modal Alignment in Medical Report Generation

AU - Mei, Xin

AU - Yang, Libin

AU - Gao, Denghong

AU - Cai, Xiaoyan

AU - Han, Junwei

AU - Liu, Tianming

PY - 2025

Y1 - 2025

N2 - Medical report generation refers to the automatic creation of accurate and coherent diagnostic reports for medical images. This task can alleviate the workload of radiologists, enhance the efficiency of disease diagnosis, and therefore holds significant value and challenges. Considering the feature differences between different modalities, existing methods primarily focus on facilitating medical report generation through cross-modal alignment of images and texts. However, since medical images are very similar to each other, it is difficult to tag obvious objects, making most methods limited to coarse-grained image-text global alignment. In this paper, we propose a medical report generation model based on adaptive topic learning and fine-grained cross-modal alignment, which aligns images and texts from medical topic perspective and token perspective. From the medical topic perspective, a global-local contrastive loss is introduced to adaptively learn efficient medical topic features, and medical topics are utilized to map images and texts to the same semantic space for fine-grained alignment. From the token perspective, a token prediction module is designed to enable the model to focus on important local information by predicting the key tokens contained in the report. Experimental results on the two public datasets (i.e. IU-Xray and MIMIC-CXR) demonstrate that our proposed model outperforms state-of-the-art baselines.

AB - Medical report generation refers to the automatic creation of accurate and coherent diagnostic reports for medical images. This task can alleviate the workload of radiologists, enhance the efficiency of disease diagnosis, and therefore holds significant value and challenges. Considering the feature differences between different modalities, existing methods primarily focus on facilitating medical report generation through cross-modal alignment of images and texts. However, since medical images are very similar to each other, it is difficult to tag obvious objects, making most methods limited to coarse-grained image-text global alignment. In this paper, we propose a medical report generation model based on adaptive topic learning and fine-grained cross-modal alignment, which aligns images and texts from medical topic perspective and token perspective. From the medical topic perspective, a global-local contrastive loss is introduced to adaptively learn efficient medical topic features, and medical topics are utilized to map images and texts to the same semantic space for fine-grained alignment. From the token perspective, a token prediction module is designed to enable the model to focus on important local information by predicting the key tokens contained in the report. Experimental results on the two public datasets (i.e. IU-Xray and MIMIC-CXR) demonstrate that our proposed model outperforms state-of-the-art baselines.

KW - Medical report generation

KW - cross-modal alignment

KW - deep learning

KW - feature extraction

UR - http://www.scopus.com/inward/record.url?scp=85218777502&partnerID=8YFLogxK

U2 - 10.1109/TMM.2025.3543101

DO - 10.1109/TMM.2025.3543101

M3 - 文章

AN - SCOPUS:85218777502

SN - 1520-9210

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

ER -

Adaptive Medical Topic Learning for Enhanced Fine-grained Cross-modal Alignment in Medical Report Generation

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this