Loanword Identification in Low-Resource Languages with Minimal Supervision

Chenggang Mi, Lei Xie, Yanning Zhang

科研成果: 期刊稿件文章同行评审

11 引用 (Scopus)

摘要

Bilingual resources play a very important role in many natural language processing tasks, especially the tasks in cross-lingual scenarios. However, it is expensive and time consuming to build such resources. Lexical borrowing happens in almost every language. This inspires us to detect these loanwords effectively, and to use the "loanword (in receipt language)"-"donor word (in donor language)" to extend the bilingual resource for NLP tasks in low-resource languages. In this article, we propose a novel method to identify loanwords in Uyghur. The most important advantage of this method is that the model only relies on large amount of monolingual corpora and only a small scale of annotated data. Our loanword identification model includes two parts: loanword candidate generation and loanword prediction. In the first part, we use two large-scale monolingual corpora and a small bilingual dictionary to train a cross-lingual embedding model. Since semantic unrelated words often cannot be treated as loanword pairs, a loanword candidate list will be generated according to this model and a word list in Uyghur. In the second part, we predict from the preceding candidates based on a log-linear model that integrates several features such as pronunciation similarity, part-of-speech tags, and hybrid language modeling. To evaluate the effectiveness of our proposed method, we conduct two types of experiments: loanword identification and OOV translation. Experimental results showed that (1) our proposed method achieved significant F1 improvements compared to other models in all four loanword identification tasks in Uyghur, and (2) after extending the existing translation models with loanword identification results, OOV rates in several language pairs reduced significantly and the translation performance improved.

源语言英语
文章编号3374212
期刊ACM Transactions on Asian and Low-Resource Language Information Processing
19
3
DOI
出版状态已出版 - 20 2月 2020

指纹

探究 'Loanword Identification in Low-Resource Languages with Minimal Supervision' 的科研主题。它们共同构成独一无二的指纹。

引用此