Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions

Hai Cheng Yi; Zhu Hong You; Li Cheng; Xi Zhou; Tong Hai Jiang; Xiao Li; Yan Bin Wang

doi:10.1016/j.csbj.2019.11.004

Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions

Hai Cheng Yi, Zhu Hong You, Li Cheng, Xi Zhou, Tong Hai Jiang, Xiao Li, Yan Bin Wang

Research output: Contribution to journal › Article › peer-review

37 Scopus citations

Abstract

The long noncoding RNAs (lncRNAs) are ubiquitous in organisms and play crucial role in a variety of biological processes and complex diseases. Emerging evidences suggest that lncRNAs interact with corresponding proteins to perform their regulatory functions. Therefore, identifying interacting lncRNA-protein pairs is the first step in understanding the function and mechanism of lncRNA. Since it is time-consuming and expensive to determine lncRNA-protein interactions by high-throughput experiments, more robust and accurate computational methods need to be developed. In this study, we developed a new sequence distributed representation learning based method for potential lncRNA-Protein Interactions Prediction, named LPI-Pred, which is inspired by the similarity between natural language and biological sequences. More specifically, lncRNA and protein sequences were divided into k-mer segmentation, which can be regard as “word” in natural language processing. Then, we trained out the RNA2vec and Pro2vec model using word2vec and human genome-wide lncRNA and protein sequences to mine distribution representation of RNA and protein. Then, the dimension of complex features is reduced by using feature selection based on Gini information impurity measure. Finally, these discriminative features are used to train a Random Forest classifier to predict lncRNA-protein interactions. Five-fold cross-validation was adopted to evaluate the performance of LPI-Pred on three benchmark datasets, including RPI369, RPI488 and RPI2241. The results demonstrate that LPI-Pred can be a useful tool to provide reliable guidance for biological research.

Original language	English
Pages (from-to)	20-26
Number of pages	7
Journal	Computational and Structural Biotechnology Journal
Volume	18
DOIs	https://doi.org/10.1016/j.csbj.2019.11.004
State	Published - 2020
Externally published	Yes

Keywords

Distribution representation
Natural language processing
RNA-protein interaction
Word2vec

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1016/j.csbj.2019.11.004

Cite this

@article{22080cfe6be04c3c82529cb73fc2f460,

title = "Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions",

abstract = "The long noncoding RNAs (lncRNAs) are ubiquitous in organisms and play crucial role in a variety of biological processes and complex diseases. Emerging evidences suggest that lncRNAs interact with corresponding proteins to perform their regulatory functions. Therefore, identifying interacting lncRNA-protein pairs is the first step in understanding the function and mechanism of lncRNA. Since it is time-consuming and expensive to determine lncRNA-protein interactions by high-throughput experiments, more robust and accurate computational methods need to be developed. In this study, we developed a new sequence distributed representation learning based method for potential lncRNA-Protein Interactions Prediction, named LPI-Pred, which is inspired by the similarity between natural language and biological sequences. More specifically, lncRNA and protein sequences were divided into k-mer segmentation, which can be regard as “word” in natural language processing. Then, we trained out the RNA2vec and Pro2vec model using word2vec and human genome-wide lncRNA and protein sequences to mine distribution representation of RNA and protein. Then, the dimension of complex features is reduced by using feature selection based on Gini information impurity measure. Finally, these discriminative features are used to train a Random Forest classifier to predict lncRNA-protein interactions. Five-fold cross-validation was adopted to evaluate the performance of LPI-Pred on three benchmark datasets, including RPI369, RPI488 and RPI2241. The results demonstrate that LPI-Pred can be a useful tool to provide reliable guidance for biological research.",

keywords = "Distribution representation, Natural language processing, RNA-protein interaction, Word2vec",

author = "Yi, {Hai Cheng} and You, {Zhu Hong} and Li Cheng and Xi Zhou and Jiang, {Tong Hai} and Xiao Li and Wang, {Yan Bin}",

note = "Publisher Copyright: {\textcopyright} 2019 The Authors",

year = "2020",

doi = "10.1016/j.csbj.2019.11.004",

language = "英语",

volume = "18",

pages = "20--26",

journal = "Computational and Structural Biotechnology Journal",

issn = "2001-0370",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions

AU - Yi, Hai Cheng

AU - You, Zhu Hong

AU - Cheng, Li

AU - Zhou, Xi

AU - Jiang, Tong Hai

AU - Li, Xiao

AU - Wang, Yan Bin

PY - 2020

Y1 - 2020

N2 - The long noncoding RNAs (lncRNAs) are ubiquitous in organisms and play crucial role in a variety of biological processes and complex diseases. Emerging evidences suggest that lncRNAs interact with corresponding proteins to perform their regulatory functions. Therefore, identifying interacting lncRNA-protein pairs is the first step in understanding the function and mechanism of lncRNA. Since it is time-consuming and expensive to determine lncRNA-protein interactions by high-throughput experiments, more robust and accurate computational methods need to be developed. In this study, we developed a new sequence distributed representation learning based method for potential lncRNA-Protein Interactions Prediction, named LPI-Pred, which is inspired by the similarity between natural language and biological sequences. More specifically, lncRNA and protein sequences were divided into k-mer segmentation, which can be regard as “word” in natural language processing. Then, we trained out the RNA2vec and Pro2vec model using word2vec and human genome-wide lncRNA and protein sequences to mine distribution representation of RNA and protein. Then, the dimension of complex features is reduced by using feature selection based on Gini information impurity measure. Finally, these discriminative features are used to train a Random Forest classifier to predict lncRNA-protein interactions. Five-fold cross-validation was adopted to evaluate the performance of LPI-Pred on three benchmark datasets, including RPI369, RPI488 and RPI2241. The results demonstrate that LPI-Pred can be a useful tool to provide reliable guidance for biological research.

AB - The long noncoding RNAs (lncRNAs) are ubiquitous in organisms and play crucial role in a variety of biological processes and complex diseases. Emerging evidences suggest that lncRNAs interact with corresponding proteins to perform their regulatory functions. Therefore, identifying interacting lncRNA-protein pairs is the first step in understanding the function and mechanism of lncRNA. Since it is time-consuming and expensive to determine lncRNA-protein interactions by high-throughput experiments, more robust and accurate computational methods need to be developed. In this study, we developed a new sequence distributed representation learning based method for potential lncRNA-Protein Interactions Prediction, named LPI-Pred, which is inspired by the similarity between natural language and biological sequences. More specifically, lncRNA and protein sequences were divided into k-mer segmentation, which can be regard as “word” in natural language processing. Then, we trained out the RNA2vec and Pro2vec model using word2vec and human genome-wide lncRNA and protein sequences to mine distribution representation of RNA and protein. Then, the dimension of complex features is reduced by using feature selection based on Gini information impurity measure. Finally, these discriminative features are used to train a Random Forest classifier to predict lncRNA-protein interactions. Five-fold cross-validation was adopted to evaluate the performance of LPI-Pred on three benchmark datasets, including RPI369, RPI488 and RPI2241. The results demonstrate that LPI-Pred can be a useful tool to provide reliable guidance for biological research.

KW - Distribution representation

KW - Natural language processing

KW - RNA-protein interaction

KW - Word2vec

UR - http://www.scopus.com/inward/record.url?scp=85075967985&partnerID=8YFLogxK

U2 - 10.1016/j.csbj.2019.11.004

DO - 10.1016/j.csbj.2019.11.004

M3 - 文章

AN - SCOPUS:85075967985

SN - 2001-0370

VL - 18

SP - 20

EP - 26

JO - Computational and Structural Biotechnology Journal

JF - Computational and Structural Biotechnology Journal

ER -

Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions

Abstract

Keywords

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this