Skip to main navigation Skip to search Skip to main content

MiCA: Intra-Modal Integration and Cross-Modal Alignment Adapters for Parameter-Efficient Referring Image Segmentation

  • Yang Li
  • , Zitong Feng
  • , Tingrui Wang
  • , Chenyu Wang
  • , Xin Zhou
  • Northwestern Polytechnical University Xian

Research output: Contribution to journalArticlepeer-review

Abstract

Parameter-efficient transfer learning (PETL) has emerged as an effective strategy for fine-tuning large vision–language foundation models because it sharply reduces computational and memory overhead. However, existing PETL techniques underperform on dense prediction tasks that require fine-grained multimodal reasoning, such as referring image segmentation (RIS), owing to the lack of mechanisms that simultaneously strengthen local perception and enforce precise cross-modal alignment. We present a PETL framework with two lightweight and complementary adapters. The Global–Local Integrated Adapter (GLiA) enriches intra-modal features by coupling multi-scale depthwise-separable convolutions with a lightweight self-attention layer, capturing local context without sacrificing global dependencies. The Cross-Modal Alignment Adapter (CAA) explicitly aligns textual phrases with their corresponding visual regions, bridging the semantic gap between vision and language and enhancing multimodal reasoning. Experiments on three mainstream RIS benchmarks show that MiCA achieves the best accuracy while saving numerous updated parameters compared to the full fine-tuning and other PETL methods. Notably, with only 1.93% tunable backbone parameters, MiCA improves average accuracy by 0.8% across the three benchmarks compared to the baseline model.

Keywords

  • Adapter
  • Parameter-Efficient Transfer Learning
  • Referring Image Segmentation

Fingerprint

Dive into the research topics of 'MiCA: Intra-Modal Integration and Cross-Modal Alignment Adapters for Parameter-Efficient Referring Image Segmentation'. Together they form a unique fingerprint.

Cite this