Leveraging Unlabeled Corpus for Arabic Dialect Identification

Mohammed Abdelmajeed; Jiangbin Zheng; Ahmed Murtadha; Youcef Nafa; Mohammed Abaker; Muhammad Pervez Akhter

doi:10.32604/cmc.2025.059870

Leveraging Unlabeled Corpus for Arabic Dialect Identification

Mohammed Abdelmajeed, Jiangbin Zheng, Ahmed Murtadha, Youcef Nafa, Mohammed Abaker, Muhammad Pervez Akhter

软件学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Arabic Dialect Identification (DID) is a task in Natural Language Processing (NLP) that involves determining the dialect of a given piece of text in Arabic. The state-of-the-art solutions for DID are built on various deep neural networks that commonly learn the representation of sentences in response to a given dialect. Despite the effectiveness of these solutions, the performance heavily relies on the amount of labeled examples, which is labor-intensive to attain and may not be readily available in real-world scenarios. To alleviate the burden of labeling data, this paper introduces a novel solution that leverages unlabeled corpora to boost performance on the DID task. Specifically, we design an architecture that enables learning the shared information between labeled and unlabeled texts through a gradient reversal layer. The key idea is to penalize the model for learning source dataset-specific features and thus enable it to capture common knowledge regardless of the label. Finally, we evaluate the proposed solution on benchmark datasets for DID. Our extensive experiments show that it performs significantly better, especially, with sparse labeled data. By comparing our approach with existing Pre-trained Language Models (PLMs), we achieve a new state-of-the-art performance in the DID field. The code will be available on GitHub upon the paper’s acceptance.

源语言	英语
页（从-至）	3471-3491
页数	21
期刊	Computers, Materials and Continua
卷	83
期	2
DOI	https://doi.org/10.32604/cmc.2025.059870
出版状态	已出版 - 2025

访问文件

10.32604/cmc.2025.059870

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{1e2effdc33814e7488e1dfe8d50b3a16,

title = "Leveraging Unlabeled Corpus for Arabic Dialect Identification",

abstract = "Arabic Dialect Identification (DID) is a task in Natural Language Processing (NLP) that involves determining the dialect of a given piece of text in Arabic. The state-of-the-art solutions for DID are built on various deep neural networks that commonly learn the representation of sentences in response to a given dialect. Despite the effectiveness of these solutions, the performance heavily relies on the amount of labeled examples, which is labor-intensive to attain and may not be readily available in real-world scenarios. To alleviate the burden of labeling data, this paper introduces a novel solution that leverages unlabeled corpora to boost performance on the DID task. Specifically, we design an architecture that enables learning the shared information between labeled and unlabeled texts through a gradient reversal layer. The key idea is to penalize the model for learning source dataset-specific features and thus enable it to capture common knowledge regardless of the label. Finally, we evaluate the proposed solution on benchmark datasets for DID. Our extensive experiments show that it performs significantly better, especially, with sparse labeled data. By comparing our approach with existing Pre-trained Language Models (PLMs), we achieve a new state-of-the-art performance in the DID field. The code will be available on GitHub upon the paper{\textquoteright}s acceptance.",

keywords = "Arabic dialect identification, bidirectional encoder representations from transformers, gradient reversal layer, natural language processing, pre-trained language models",

author = "Mohammed Abdelmajeed and Jiangbin Zheng and Ahmed Murtadha and Youcef Nafa and Mohammed Abaker and Akhter, {Muhammad Pervez}",

note = "Publisher Copyright: Copyright {\textcopyright} 2025 The Authors.",

year = "2025",

doi = "10.32604/cmc.2025.059870",

language = "英语",

volume = "83",

pages = "3471--3491",

journal = "Computers, Materials and Continua",

issn = "1546-2218",

publisher = "Tech Science Press",

number = "2",

}

TY - JOUR

T1 - Leveraging Unlabeled Corpus for Arabic Dialect Identification

AU - Abdelmajeed, Mohammed

AU - Zheng, Jiangbin

AU - Murtadha, Ahmed

AU - Nafa, Youcef

AU - Abaker, Mohammed

AU - Akhter, Muhammad Pervez

PY - 2025

Y1 - 2025

N2 - Arabic Dialect Identification (DID) is a task in Natural Language Processing (NLP) that involves determining the dialect of a given piece of text in Arabic. The state-of-the-art solutions for DID are built on various deep neural networks that commonly learn the representation of sentences in response to a given dialect. Despite the effectiveness of these solutions, the performance heavily relies on the amount of labeled examples, which is labor-intensive to attain and may not be readily available in real-world scenarios. To alleviate the burden of labeling data, this paper introduces a novel solution that leverages unlabeled corpora to boost performance on the DID task. Specifically, we design an architecture that enables learning the shared information between labeled and unlabeled texts through a gradient reversal layer. The key idea is to penalize the model for learning source dataset-specific features and thus enable it to capture common knowledge regardless of the label. Finally, we evaluate the proposed solution on benchmark datasets for DID. Our extensive experiments show that it performs significantly better, especially, with sparse labeled data. By comparing our approach with existing Pre-trained Language Models (PLMs), we achieve a new state-of-the-art performance in the DID field. The code will be available on GitHub upon the paper’s acceptance.

AB - Arabic Dialect Identification (DID) is a task in Natural Language Processing (NLP) that involves determining the dialect of a given piece of text in Arabic. The state-of-the-art solutions for DID are built on various deep neural networks that commonly learn the representation of sentences in response to a given dialect. Despite the effectiveness of these solutions, the performance heavily relies on the amount of labeled examples, which is labor-intensive to attain and may not be readily available in real-world scenarios. To alleviate the burden of labeling data, this paper introduces a novel solution that leverages unlabeled corpora to boost performance on the DID task. Specifically, we design an architecture that enables learning the shared information between labeled and unlabeled texts through a gradient reversal layer. The key idea is to penalize the model for learning source dataset-specific features and thus enable it to capture common knowledge regardless of the label. Finally, we evaluate the proposed solution on benchmark datasets for DID. Our extensive experiments show that it performs significantly better, especially, with sparse labeled data. By comparing our approach with existing Pre-trained Language Models (PLMs), we achieve a new state-of-the-art performance in the DID field. The code will be available on GitHub upon the paper’s acceptance.

KW - Arabic dialect identification

KW - bidirectional encoder representations from transformers

KW - gradient reversal layer

KW - natural language processing

KW - pre-trained language models

UR - http://www.scopus.com/inward/record.url?scp=105003321891&partnerID=8YFLogxK

U2 - 10.32604/cmc.2025.059870

DO - 10.32604/cmc.2025.059870

M3 - 文章

AN - SCOPUS:105003321891

SN - 1546-2218

VL - 83

SP - 3471

EP - 3491

JO - Computers, Materials and Continua

JF - Computers, Materials and Continua

IS - 2

ER -

Leveraging Unlabeled Corpus for Arabic Dialect Identification

摘要

访问文件

其它文件与链接

指纹

引用此