Abstract
Image–text matching is an essential research area within multimedia research. However, images often contain richer information than text, and representing an image with only one vector can be limited to fully capture its semantics, leading to suboptimal performance in cross-modal matching tasks. To address this limitation, we propose a CLIP-based knowledge projector network that encodes an image into a set of embeddings. These embeddings capture different semantics of an image, guided by prior knowledge from the large vision-language pretrained model CLIP(Contrastive Language-Image Pre-Training). To ensure that the generated slot features stay aligned with global semantics, we design an adaptive weighted fusion module that incorporates global features into slot representations. During the test phase, we present an effective and explainable similarity calculation method compared with existing fine-grained image–text matching methods. The proposed framework's effectiveness is evidenced by the experimental results, with performance improvements of at least 7% in R@1 on image retrieval tasks compared to CLIP on the MSCOCO and Flickr30K datasets.
| Original language | English |
|---|---|
| Article number | 104357 |
| Journal | Information Processing and Management |
| Volume | 63 |
| Issue number | 1 |
| DOIs | |
| State | Published - Jan 2026 |
Keywords
- Image–text matching
- Multimedia analysis
- Slot attention
Fingerprint
Dive into the research topics of 'CLIP-based knowledge projector for image–text matching'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver