Prompt-Based Modality Alignment for Effective Multi-Modal Object Re-Identification

Shizhou Zhang; Wenlong Luo; De Cheng; Yinghui Xing; Guoqiang Liang; Peng Wang; Yanning Zhang

doi:10.1109/TIP.2025.3556531

Prompt-Based Modality Alignment for Effective Multi-Modal Object Re-Identification

Shizhou Zhang, Wenlong Luo, De Cheng, Yinghui Xing, Guoqiang Liang, Peng Wang, Yanning Zhang

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

1 引用（Scopus）

摘要

A critical challenge for multi-modal Object Re-Identification (ReID) is the effective aggregation of complementary information to mitigate illumination issues. State-of-the-art methods typically employ complex and highly-coupled architectures, which unavoidably result in heavy computational costs. Moreover, the significant distribution gap among different image spectra hinders the joint representation of multi-modal features. In this paper, we propose a framework named as PromptMA to establish effective communication channels between different modality paths, thereby aggregating modal complementary information and bridging the distribution gap. Specifically, we inject a series of learnable multi-modal prompts into the Image Encoder and introduce a prompt exchange mechanism to enable the prompts to alternately interact with different modal token embeddings, thus capturing and distributing multi-modal features effectively. Building on top of the multi-modal prompts, we further propose Prompt-based Token Selection (PBTS) and Prompt-based Modality Fusion (PBMF) modules to achieve effective multi-modal feature fusion while minimizing background interference. Additionally, due to the flexibility of our prompt exchange mechanism, our method is well-suited to handle scenarios with missing modalities. Extensive evaluations are conducted on four widely used benchmark datasets and the experimental results demonstrate that our method achieves state-of-the-art performances, surpassing the current benchmarks by over 15% on the challenging MSVR310 dataset and by 6% on the RGBNT201.

源语言	英语
页（从-至）	2450-2462
页数	13
期刊	IEEE Transactions on Image Processing
卷	34
DOI	https://doi.org/10.1109/TIP.2025.3556531
出版状态	已出版 - 2025

访问文件

10.1109/TIP.2025.3556531

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{8024b533d1e2479e98e7227db9cedab6,

title = "Prompt-Based Modality Alignment for Effective Multi-Modal Object Re-Identification",

abstract = "A critical challenge for multi-modal Object Re-Identification (ReID) is the effective aggregation of complementary information to mitigate illumination issues. State-of-the-art methods typically employ complex and highly-coupled architectures, which unavoidably result in heavy computational costs. Moreover, the significant distribution gap among different image spectra hinders the joint representation of multi-modal features. In this paper, we propose a framework named as PromptMA to establish effective communication channels between different modality paths, thereby aggregating modal complementary information and bridging the distribution gap. Specifically, we inject a series of learnable multi-modal prompts into the Image Encoder and introduce a prompt exchange mechanism to enable the prompts to alternately interact with different modal token embeddings, thus capturing and distributing multi-modal features effectively. Building on top of the multi-modal prompts, we further propose Prompt-based Token Selection (PBTS) and Prompt-based Modality Fusion (PBMF) modules to achieve effective multi-modal feature fusion while minimizing background interference. Additionally, due to the flexibility of our prompt exchange mechanism, our method is well-suited to handle scenarios with missing modalities. Extensive evaluations are conducted on four widely used benchmark datasets and the experimental results demonstrate that our method achieves state-of-the-art performances, surpassing the current benchmarks by over 15% on the challenging MSVR310 dataset and by 6% on the RGBNT201.",

keywords = "Re-identification, multi-modal, prompt exchange, token selection",

author = "Shizhou Zhang and Wenlong Luo and De Cheng and Yinghui Xing and Guoqiang Liang and Peng Wang and Yanning Zhang",

note = "Publisher Copyright: {\textcopyright} 1992-2012 IEEE.",

year = "2025",

doi = "10.1109/TIP.2025.3556531",

language = "英语",

volume = "34",

pages = "2450--2462",

journal = "IEEE Transactions on Image Processing",

issn = "1057-7149",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Prompt-Based Modality Alignment for Effective Multi-Modal Object Re-Identification

AU - Zhang, Shizhou

AU - Luo, Wenlong

AU - Cheng, De

AU - Xing, Yinghui

AU - Liang, Guoqiang

AU - Wang, Peng

AU - Zhang, Yanning

PY - 2025

Y1 - 2025

N2 - A critical challenge for multi-modal Object Re-Identification (ReID) is the effective aggregation of complementary information to mitigate illumination issues. State-of-the-art methods typically employ complex and highly-coupled architectures, which unavoidably result in heavy computational costs. Moreover, the significant distribution gap among different image spectra hinders the joint representation of multi-modal features. In this paper, we propose a framework named as PromptMA to establish effective communication channels between different modality paths, thereby aggregating modal complementary information and bridging the distribution gap. Specifically, we inject a series of learnable multi-modal prompts into the Image Encoder and introduce a prompt exchange mechanism to enable the prompts to alternately interact with different modal token embeddings, thus capturing and distributing multi-modal features effectively. Building on top of the multi-modal prompts, we further propose Prompt-based Token Selection (PBTS) and Prompt-based Modality Fusion (PBMF) modules to achieve effective multi-modal feature fusion while minimizing background interference. Additionally, due to the flexibility of our prompt exchange mechanism, our method is well-suited to handle scenarios with missing modalities. Extensive evaluations are conducted on four widely used benchmark datasets and the experimental results demonstrate that our method achieves state-of-the-art performances, surpassing the current benchmarks by over 15% on the challenging MSVR310 dataset and by 6% on the RGBNT201.

AB - A critical challenge for multi-modal Object Re-Identification (ReID) is the effective aggregation of complementary information to mitigate illumination issues. State-of-the-art methods typically employ complex and highly-coupled architectures, which unavoidably result in heavy computational costs. Moreover, the significant distribution gap among different image spectra hinders the joint representation of multi-modal features. In this paper, we propose a framework named as PromptMA to establish effective communication channels between different modality paths, thereby aggregating modal complementary information and bridging the distribution gap. Specifically, we inject a series of learnable multi-modal prompts into the Image Encoder and introduce a prompt exchange mechanism to enable the prompts to alternately interact with different modal token embeddings, thus capturing and distributing multi-modal features effectively. Building on top of the multi-modal prompts, we further propose Prompt-based Token Selection (PBTS) and Prompt-based Modality Fusion (PBMF) modules to achieve effective multi-modal feature fusion while minimizing background interference. Additionally, due to the flexibility of our prompt exchange mechanism, our method is well-suited to handle scenarios with missing modalities. Extensive evaluations are conducted on four widely used benchmark datasets and the experimental results demonstrate that our method achieves state-of-the-art performances, surpassing the current benchmarks by over 15% on the challenging MSVR310 dataset and by 6% on the RGBNT201.

KW - Re-identification

KW - multi-modal

KW - prompt exchange

KW - token selection

UR - http://www.scopus.com/inward/record.url?scp=105002591688&partnerID=8YFLogxK

U2 - 10.1109/TIP.2025.3556531

DO - 10.1109/TIP.2025.3556531

M3 - 文章

AN - SCOPUS:105002591688

SN - 1057-7149

VL - 34

SP - 2450

EP - 2462

JO - IEEE Transactions on Image Processing

JF - IEEE Transactions on Image Processing

ER -

Prompt-Based Modality Alignment for Effective Multi-Modal Object Re-Identification

摘要

访问文件

其它文件与链接

指纹

引用此