Enhancing Unimodal Features Matters: A Multimodal Framework for Building Extraction

Xiaofeng Shi; Junyu Gao; Yuan Yuan

doi:10.1109/TGRS.2024.3392631

Enhancing Unimodal Features Matters: A Multimodal Framework for Building Extraction

Xiaofeng Shi, Junyu Gao, Yuan Yuan

光电与智能研究院

Northwestern Polytechnical University Xian

科研成果: 期刊稿件 › 文章 › 同行评审

5 引用（Scopus）

摘要

In recent years, deep learning and multimodal data have substantially propelled the development of building extraction models. However, prevailing multimodal methods are difficult to cope with two challenges: 1) modal laziness: the training error is minimized before the model has learned extensive unimodal patterns and 2) modal imbalance: the backpropagation process is easily dominated by a certain modality. As a result, the unimodal features learning is insufficient, leading to limited performance of the model when dealing with the intricate foreground and background contexts surrounding the buildings. In this article, we deal with this problem from the perspective of algorithm and model evaluation. At the algorithmic level, we propose a unimodal feature enhancement (UFE) framework. Specifically, UFE is model-agnostic, comprising two distinct components: adaptive gradient enhancement (AGE) for modal laziness and consistency constraint loss (CCL) for modal imbalance. AGE dynamically modulates the original gradient by monitoring the representation effects of unimodal features and multimodal fusion features. CCL imposes mutual constraints on diverse modal branches at the semantic level to reconcile the optimization process. At the model evaluation level, a new metric, named unimodal utilization ratio (UUR), is presented to assess models through the learning efficacy of unimodal features. The experimental results including the variants of UUR on two building extraction datasets demonstrate a substantial performance improvement by UFE. Moreover, UFE also exhibits its adaptability when integrated with various model components and its generalization on other multimodal image-related tasks.

源语言	英语
文章编号	5622013
页（从-至）	1-13
页数	13
期刊	IEEE Transactions on Geoscience and Remote Sensing
卷	62
DOI	https://doi.org/10.1109/TGRS.2024.3392631
出版状态	已出版 - 2024

访问文件

10.1109/TGRS.2024.3392631

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{9b63f8d423ba4e2eae11adf7cd3e8da2,

title = "Enhancing Unimodal Features Matters: A Multimodal Framework for Building Extraction",

abstract = "In recent years, deep learning and multimodal data have substantially propelled the development of building extraction models. However, prevailing multimodal methods are difficult to cope with two challenges: 1) modal laziness: the training error is minimized before the model has learned extensive unimodal patterns and 2) modal imbalance: the backpropagation process is easily dominated by a certain modality. As a result, the unimodal features learning is insufficient, leading to limited performance of the model when dealing with the intricate foreground and background contexts surrounding the buildings. In this article, we deal with this problem from the perspective of algorithm and model evaluation. At the algorithmic level, we propose a unimodal feature enhancement (UFE) framework. Specifically, UFE is model-agnostic, comprising two distinct components: adaptive gradient enhancement (AGE) for modal laziness and consistency constraint loss (CCL) for modal imbalance. AGE dynamically modulates the original gradient by monitoring the representation effects of unimodal features and multimodal fusion features. CCL imposes mutual constraints on diverse modal branches at the semantic level to reconcile the optimization process. At the model evaluation level, a new metric, named unimodal utilization ratio (UUR), is presented to assess models through the learning efficacy of unimodal features. The experimental results including the variants of UUR on two building extraction datasets demonstrate a substantial performance improvement by UFE. Moreover, UFE also exhibits its adaptability when integrated with various model components and its generalization on other multimodal image-related tasks.",

keywords = "Building extraction, modal imbalance, modal laziness, multimodal fusion, unimodal feature enhancement (UFE)",

author = "Xiaofeng Shi and Junyu Gao and Yuan Yuan",

note = "Publisher Copyright: {\textcopyright} 1980-2012 IEEE.",

year = "2024",

doi = "10.1109/TGRS.2024.3392631",

language = "英语",

volume = "62",

pages = "1--13",

journal = "IEEE Transactions on Geoscience and Remote Sensing",

issn = "0196-2892",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Enhancing Unimodal Features Matters

T2 - A Multimodal Framework for Building Extraction

AU - Shi, Xiaofeng

AU - Gao, Junyu

AU - Yuan, Yuan

PY - 2024

Y1 - 2024

N2 - In recent years, deep learning and multimodal data have substantially propelled the development of building extraction models. However, prevailing multimodal methods are difficult to cope with two challenges: 1) modal laziness: the training error is minimized before the model has learned extensive unimodal patterns and 2) modal imbalance: the backpropagation process is easily dominated by a certain modality. As a result, the unimodal features learning is insufficient, leading to limited performance of the model when dealing with the intricate foreground and background contexts surrounding the buildings. In this article, we deal with this problem from the perspective of algorithm and model evaluation. At the algorithmic level, we propose a unimodal feature enhancement (UFE) framework. Specifically, UFE is model-agnostic, comprising two distinct components: adaptive gradient enhancement (AGE) for modal laziness and consistency constraint loss (CCL) for modal imbalance. AGE dynamically modulates the original gradient by monitoring the representation effects of unimodal features and multimodal fusion features. CCL imposes mutual constraints on diverse modal branches at the semantic level to reconcile the optimization process. At the model evaluation level, a new metric, named unimodal utilization ratio (UUR), is presented to assess models through the learning efficacy of unimodal features. The experimental results including the variants of UUR on two building extraction datasets demonstrate a substantial performance improvement by UFE. Moreover, UFE also exhibits its adaptability when integrated with various model components and its generalization on other multimodal image-related tasks.

AB - In recent years, deep learning and multimodal data have substantially propelled the development of building extraction models. However, prevailing multimodal methods are difficult to cope with two challenges: 1) modal laziness: the training error is minimized before the model has learned extensive unimodal patterns and 2) modal imbalance: the backpropagation process is easily dominated by a certain modality. As a result, the unimodal features learning is insufficient, leading to limited performance of the model when dealing with the intricate foreground and background contexts surrounding the buildings. In this article, we deal with this problem from the perspective of algorithm and model evaluation. At the algorithmic level, we propose a unimodal feature enhancement (UFE) framework. Specifically, UFE is model-agnostic, comprising two distinct components: adaptive gradient enhancement (AGE) for modal laziness and consistency constraint loss (CCL) for modal imbalance. AGE dynamically modulates the original gradient by monitoring the representation effects of unimodal features and multimodal fusion features. CCL imposes mutual constraints on diverse modal branches at the semantic level to reconcile the optimization process. At the model evaluation level, a new metric, named unimodal utilization ratio (UUR), is presented to assess models through the learning efficacy of unimodal features. The experimental results including the variants of UUR on two building extraction datasets demonstrate a substantial performance improvement by UFE. Moreover, UFE also exhibits its adaptability when integrated with various model components and its generalization on other multimodal image-related tasks.

KW - Building extraction

KW - modal imbalance

KW - modal laziness

KW - multimodal fusion

KW - unimodal feature enhancement (UFE)

UR - http://www.scopus.com/inward/record.url?scp=85191347237&partnerID=8YFLogxK

U2 - 10.1109/TGRS.2024.3392631

DO - 10.1109/TGRS.2024.3392631

M3 - 文章

AN - SCOPUS:85191347237

SN - 0196-2892

VL - 62

SP - 1

EP - 13

JO - IEEE Transactions on Geoscience and Remote Sensing

JF - IEEE Transactions on Geoscience and Remote Sensing

M1 - 5622013

ER -

Enhancing Unimodal Features Matters: A Multimodal Framework for Building Extraction

摘要

访问文件

其它文件与链接

指纹

引用此