TY - JOUR
T1 - Leveraging Unlabeled Corpus for Arabic Dialect Identification
AU - Abdelmajeed, Mohammed
AU - Zheng, Jiangbin
AU - Murtadha, Ahmed
AU - Nafa, Youcef
AU - Abaker, Mohammed
AU - Akhter, Muhammad Pervez
N1 - Publisher Copyright:
Copyright © 2025 The Authors.
PY - 2025
Y1 - 2025
N2 - Arabic Dialect Identification (DID) is a task in Natural Language Processing (NLP) that involves determining the dialect of a given piece of text in Arabic. The state-of-the-art solutions for DID are built on various deep neural networks that commonly learn the representation of sentences in response to a given dialect. Despite the effectiveness of these solutions, the performance heavily relies on the amount of labeled examples, which is labor-intensive to attain and may not be readily available in real-world scenarios. To alleviate the burden of labeling data, this paper introduces a novel solution that leverages unlabeled corpora to boost performance on the DID task. Specifically, we design an architecture that enables learning the shared information between labeled and unlabeled texts through a gradient reversal layer. The key idea is to penalize the model for learning source dataset-specific features and thus enable it to capture common knowledge regardless of the label. Finally, we evaluate the proposed solution on benchmark datasets for DID. Our extensive experiments show that it performs significantly better, especially, with sparse labeled data. By comparing our approach with existing Pre-trained Language Models (PLMs), we achieve a new state-of-the-art performance in the DID field. The code will be available on GitHub upon the paper’s acceptance.
AB - Arabic Dialect Identification (DID) is a task in Natural Language Processing (NLP) that involves determining the dialect of a given piece of text in Arabic. The state-of-the-art solutions for DID are built on various deep neural networks that commonly learn the representation of sentences in response to a given dialect. Despite the effectiveness of these solutions, the performance heavily relies on the amount of labeled examples, which is labor-intensive to attain and may not be readily available in real-world scenarios. To alleviate the burden of labeling data, this paper introduces a novel solution that leverages unlabeled corpora to boost performance on the DID task. Specifically, we design an architecture that enables learning the shared information between labeled and unlabeled texts through a gradient reversal layer. The key idea is to penalize the model for learning source dataset-specific features and thus enable it to capture common knowledge regardless of the label. Finally, we evaluate the proposed solution on benchmark datasets for DID. Our extensive experiments show that it performs significantly better, especially, with sparse labeled data. By comparing our approach with existing Pre-trained Language Models (PLMs), we achieve a new state-of-the-art performance in the DID field. The code will be available on GitHub upon the paper’s acceptance.
KW - Arabic dialect identification
KW - bidirectional encoder representations from transformers
KW - gradient reversal layer
KW - natural language processing
KW - pre-trained language models
UR - http://www.scopus.com/inward/record.url?scp=105003321891&partnerID=8YFLogxK
U2 - 10.32604/cmc.2025.059870
DO - 10.32604/cmc.2025.059870
M3 - 文章
AN - SCOPUS:105003321891
SN - 1546-2218
VL - 83
SP - 3471
EP - 3491
JO - Computers, Materials and Continua
JF - Computers, Materials and Continua
IS - 2
ER -