DIA: Deriving linguistic information from auxiliary languages for remote sensing image captioning

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

Remote sensing image captioning (RSIC) is a cross-modal task aimed at describing scene categories, object classes, and their spatial relationships in remote sensing images using natural language. Existing methods typically focus on training models in single language, neglecting the linguistic-enhancing information derived from syntactic structure differences and diverse expressions of the same objects and scenes. This information, present in cross-linguistic annotated data, can significantly enhance language perception and enrich training data. To verify the effectiveness of this information, we propose an auxiliary language-enhanced network called DIA, which leverages linguistic information from auxiliary languages to improve the quality and fluency of target language generation. DIA consists of shared visual feature extractor, target language generator, and auxiliary language generator. The shared visual feature extractor integrates the Linguistic-Irrelevant Feature Enrichment (LiFE) module, while a Linguistic Bridge connects the target and auxiliary language generators. The LiFE module employs linguistic-irrelevant feature extraction and multi-view attention to extract precise visual features, enriching the representations while minimizing language bias. Multi-view attention balances deep semantic expressions and linguistic-irrelevant features. The Linguistic Bridge establishes interactive pathway between the target language generator (ALG) and the auxiliary language generator (TLG), enabling the TLG to learn from the ALG's language modeling capabilities. This interaction allows the TLG to handle complex visual features, improving language generation performance. Extensive experiments demonstrate that our model achieves significant performance improvements on the UCM, Sydney, RSICD, and NWPU datasets. Specifically, on the UCM dataset, BLEU-4 is improved by 5.06 %, and CIDEr is improved by 16.86 %.

Original languageEnglish
Article number112209
JournalPattern Recognition
Volume171
DOIs
StatePublished - Mar 2026

Keywords

  • Auxiliary language-enhanced network
  • Linguistic bridge
  • Linguistic-enhancing information
  • Remote sensing image captioning

Fingerprint

Dive into the research topics of 'DIA: Deriving linguistic information from auxiliary languages for remote sensing image captioning'. Together they form a unique fingerprint.

Cite this