Weakly supervised text classification framework for noisy-labeled imbalanced samples

Wenxin Zhang; Yaya Zhou; Shuhui Liu; Yupei Zhang; Xuequn Shang

doi:10.1016/j.neucom.2024.128617

Weakly supervised text classification framework for noisy-labeled imbalanced samples

Wenxin Zhang, Yaya Zhou, Shuhui Liu, Yupei Zhang, Xuequn Shang

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

The goal of this study is to solve the combined issue of noise labels and imbalanced samples for text classification. Current studies generally adopt data sampling or cleaning in model learning, leading to a part of information loss. To this end, this paper introduces a weakly supervised text classification framework, dubbed WeStcoin, which aims to learn a clear hierarchical attention network directly from the given noisy-labeled imbalanced samples. Specifically, WeStcoin first vectorizes the given texts to generate a contextualized corpus on which the pseudo-label vector is calculated by extracting seed words from each class and the predicted label vector is obtained by a hierarchical attention network. Based on the pseudo and predicted label vectors, we learn a cost-sensitive matrix to project the concatenated label vectors into the given label space. WeStcoin is trained iteratively to reduce the difference between the output labels and the given noisy labels by updating the network parameters, the set of seed words, and the cost-sensitive matrix, respectively. Finally, extended experiments on short-text classification shows that WeStcoin achieves a significant improvement than the state-of-the-art models in imbalanced samples with noisy labels. Besides, WeStcoin acts more robustly than compared methods and provides potential explanations for noisy labels.

源语言	英语
文章编号	128617
期刊	Neurocomputing
卷	610
DOI	https://doi.org/10.1016/j.neucom.2024.128617
出版状态	已出版 - 28 12月 2024

访问文件

10.1016/j.neucom.2024.128617

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{da095d1d9a6f488e80d8401a3488d1c2,

title = "Weakly supervised text classification framework for noisy-labeled imbalanced samples",

abstract = "The goal of this study is to solve the combined issue of noise labels and imbalanced samples for text classification. Current studies generally adopt data sampling or cleaning in model learning, leading to a part of information loss. To this end, this paper introduces a weakly supervised text classification framework, dubbed WeStcoin, which aims to learn a clear hierarchical attention network directly from the given noisy-labeled imbalanced samples. Specifically, WeStcoin first vectorizes the given texts to generate a contextualized corpus on which the pseudo-label vector is calculated by extracting seed words from each class and the predicted label vector is obtained by a hierarchical attention network. Based on the pseudo and predicted label vectors, we learn a cost-sensitive matrix to project the concatenated label vectors into the given label space. WeStcoin is trained iteratively to reduce the difference between the output labels and the given noisy labels by updating the network parameters, the set of seed words, and the cost-sensitive matrix, respectively. Finally, extended experiments on short-text classification shows that WeStcoin achieves a significant improvement than the state-of-the-art models in imbalanced samples with noisy labels. Besides, WeStcoin acts more robustly than compared methods and provides potential explanations for noisy labels.",

keywords = "Cost-sensitive matrix, Deep learning, Imbalanced data, Neural networks, Noisy label, Short-text classification, Weak supervision",

author = "Wenxin Zhang and Yaya Zhou and Shuhui Liu and Yupei Zhang and Xuequn Shang",

note = "Publisher Copyright: {\textcopyright} 2024 Elsevier B.V.",

year = "2024",

month = dec,

day = "28",

doi = "10.1016/j.neucom.2024.128617",

language = "英语",

volume = "610",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Weakly supervised text classification framework for noisy-labeled imbalanced samples

AU - Zhang, Wenxin

AU - Zhou, Yaya

AU - Liu, Shuhui

AU - Zhang, Yupei

AU - Shang, Xuequn

PY - 2024/12/28

Y1 - 2024/12/28

N2 - The goal of this study is to solve the combined issue of noise labels and imbalanced samples for text classification. Current studies generally adopt data sampling or cleaning in model learning, leading to a part of information loss. To this end, this paper introduces a weakly supervised text classification framework, dubbed WeStcoin, which aims to learn a clear hierarchical attention network directly from the given noisy-labeled imbalanced samples. Specifically, WeStcoin first vectorizes the given texts to generate a contextualized corpus on which the pseudo-label vector is calculated by extracting seed words from each class and the predicted label vector is obtained by a hierarchical attention network. Based on the pseudo and predicted label vectors, we learn a cost-sensitive matrix to project the concatenated label vectors into the given label space. WeStcoin is trained iteratively to reduce the difference between the output labels and the given noisy labels by updating the network parameters, the set of seed words, and the cost-sensitive matrix, respectively. Finally, extended experiments on short-text classification shows that WeStcoin achieves a significant improvement than the state-of-the-art models in imbalanced samples with noisy labels. Besides, WeStcoin acts more robustly than compared methods and provides potential explanations for noisy labels.

AB - The goal of this study is to solve the combined issue of noise labels and imbalanced samples for text classification. Current studies generally adopt data sampling or cleaning in model learning, leading to a part of information loss. To this end, this paper introduces a weakly supervised text classification framework, dubbed WeStcoin, which aims to learn a clear hierarchical attention network directly from the given noisy-labeled imbalanced samples. Specifically, WeStcoin first vectorizes the given texts to generate a contextualized corpus on which the pseudo-label vector is calculated by extracting seed words from each class and the predicted label vector is obtained by a hierarchical attention network. Based on the pseudo and predicted label vectors, we learn a cost-sensitive matrix to project the concatenated label vectors into the given label space. WeStcoin is trained iteratively to reduce the difference between the output labels and the given noisy labels by updating the network parameters, the set of seed words, and the cost-sensitive matrix, respectively. Finally, extended experiments on short-text classification shows that WeStcoin achieves a significant improvement than the state-of-the-art models in imbalanced samples with noisy labels. Besides, WeStcoin acts more robustly than compared methods and provides potential explanations for noisy labels.

KW - Cost-sensitive matrix

KW - Deep learning

KW - Imbalanced data

KW - Neural networks

KW - Noisy label

KW - Short-text classification

KW - Weak supervision

UR - http://www.scopus.com/inward/record.url?scp=85204372704&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2024.128617

DO - 10.1016/j.neucom.2024.128617

M3 - 文章

AN - SCOPUS:85204372704

SN - 0925-2312

VL - 610

JO - Neurocomputing

JF - Neurocomputing

M1 - 128617

ER -

Weakly supervised text classification framework for noisy-labeled imbalanced samples

摘要

访问文件

其它文件与链接

指纹

引用此