A fast and high performance multiple data integration algorithm for identifying human disease genes

Bolin Chen; Min Li; Jianxin Wang; Xuequn Shang; Fang Xiang Wu

doi:10.1186/1755-8794-8-S3-S2

A fast and high performance multiple data integration algorithm for identifying human disease genes

Bolin Chen, Min Li, Jianxin Wang, Xuequn Shang, Fang Xiang Wu

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

42 引用（Scopus）

摘要

Background: Integrating multiple data sources is indispensable in improving disease gene identification. It is not only due to the fact that disease genes associated with similar genetic diseases tend to lie close with each other in various biological networks, but also due to the fact that gene-disease associations are complex. Although various algorithms have been proposed to identify disease genes, their prediction performances and the computational time still should be further improved. Results: In this study, we propose a fast and high performance multiple data integration algorithm for identifying human disease genes. A posterior probability of each candidate gene associated with individual diseases is calculated by using a Bayesian analysis method and a binary logistic regression model. Two prior probability estimation strategies and two feature vector construction methods are developed to test the performance of the proposed algorithm. Conclusions: The proposed algorithm is not only generated predictions with high AUC scores, but also runs very fast. When only a single PPI network is employed, the AUC score is 0.769 by using F ₂ as feature vectors. The average running time for each leave-one-out experiment is only around 1.5 seconds. When three biological networks are integrated, the AUC score using F ₃ as feature vectors increases to 0.830, and the average running time for each leave-one-out experiment takes only about 12.54 seconds. It is better than many existing algorithms.

源语言	英语
文章编号	S2
期刊	BMC Medical Genomics
卷	8
期	3
DOI	https://doi.org/10.1186/1755-8794-8-S3-S2
出版状态	已出版 - 23 9月 2015

联合国可持续发展目标

此成果有助于实现下列可持续发展目标：

访问文件

10.1186/1755-8794-8-S3-S2

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{a29a468b48cb43f9be1287da79c4cf33,

title = "A fast and high performance multiple data integration algorithm for identifying human disease genes",

abstract = "Background: Integrating multiple data sources is indispensable in improving disease gene identification. It is not only due to the fact that disease genes associated with similar genetic diseases tend to lie close with each other in various biological networks, but also due to the fact that gene-disease associations are complex. Although various algorithms have been proposed to identify disease genes, their prediction performances and the computational time still should be further improved. Results: In this study, we propose a fast and high performance multiple data integration algorithm for identifying human disease genes. A posterior probability of each candidate gene associated with individual diseases is calculated by using a Bayesian analysis method and a binary logistic regression model. Two prior probability estimation strategies and two feature vector construction methods are developed to test the performance of the proposed algorithm. Conclusions: The proposed algorithm is not only generated predictions with high AUC scores, but also runs very fast. When only a single PPI network is employed, the AUC score is 0.769 by using F 2 as feature vectors. The average running time for each leave-one-out experiment is only around 1.5 seconds. When three biological networks are integrated, the AUC score using F 3 as feature vectors increases to 0.830, and the average running time for each leave-one-out experiment takes only about 12.54 seconds. It is better than many existing algorithms.",

keywords = "Bayesian analysis, disease gene, feature vector, logistic regression, multiple data integration",

author = "Bolin Chen and Min Li and Jianxin Wang and Xuequn Shang and Wu, {Fang Xiang}",

note = "Publisher Copyright: {\textcopyright} 2015 Chen et al.",

year = "2015",

month = sep,

day = "23",

doi = "10.1186/1755-8794-8-S3-S2",

language = "英语",

volume = "8",

journal = "BMC Medical Genomics",

issn = "1755-8794",

publisher = "BioMed Central Ltd",

number = "3",

}

TY - JOUR

T1 - A fast and high performance multiple data integration algorithm for identifying human disease genes

AU - Chen, Bolin

AU - Li, Min

AU - Wang, Jianxin

AU - Shang, Xuequn

AU - Wu, Fang Xiang

PY - 2015/9/23

Y1 - 2015/9/23

N2 - Background: Integrating multiple data sources is indispensable in improving disease gene identification. It is not only due to the fact that disease genes associated with similar genetic diseases tend to lie close with each other in various biological networks, but also due to the fact that gene-disease associations are complex. Although various algorithms have been proposed to identify disease genes, their prediction performances and the computational time still should be further improved. Results: In this study, we propose a fast and high performance multiple data integration algorithm for identifying human disease genes. A posterior probability of each candidate gene associated with individual diseases is calculated by using a Bayesian analysis method and a binary logistic regression model. Two prior probability estimation strategies and two feature vector construction methods are developed to test the performance of the proposed algorithm. Conclusions: The proposed algorithm is not only generated predictions with high AUC scores, but also runs very fast. When only a single PPI network is employed, the AUC score is 0.769 by using F 2 as feature vectors. The average running time for each leave-one-out experiment is only around 1.5 seconds. When three biological networks are integrated, the AUC score using F 3 as feature vectors increases to 0.830, and the average running time for each leave-one-out experiment takes only about 12.54 seconds. It is better than many existing algorithms.

AB - Background: Integrating multiple data sources is indispensable in improving disease gene identification. It is not only due to the fact that disease genes associated with similar genetic diseases tend to lie close with each other in various biological networks, but also due to the fact that gene-disease associations are complex. Although various algorithms have been proposed to identify disease genes, their prediction performances and the computational time still should be further improved. Results: In this study, we propose a fast and high performance multiple data integration algorithm for identifying human disease genes. A posterior probability of each candidate gene associated with individual diseases is calculated by using a Bayesian analysis method and a binary logistic regression model. Two prior probability estimation strategies and two feature vector construction methods are developed to test the performance of the proposed algorithm. Conclusions: The proposed algorithm is not only generated predictions with high AUC scores, but also runs very fast. When only a single PPI network is employed, the AUC score is 0.769 by using F 2 as feature vectors. The average running time for each leave-one-out experiment is only around 1.5 seconds. When three biological networks are integrated, the AUC score using F 3 as feature vectors increases to 0.830, and the average running time for each leave-one-out experiment takes only about 12.54 seconds. It is better than many existing algorithms.

KW - Bayesian analysis

KW - disease gene

KW - feature vector

KW - logistic regression

KW - multiple data integration

UR - http://www.scopus.com/inward/record.url?scp=84962383376&partnerID=8YFLogxK

U2 - 10.1186/1755-8794-8-S3-S2

DO - 10.1186/1755-8794-8-S3-S2

M3 - 文章

C2 - 26399620

AN - SCOPUS:84962383376

SN - 1755-8794

VL - 8

JO - BMC Medical Genomics

JF - BMC Medical Genomics

IS - 3

M1 - S2

ER -

A fast and high performance multiple data integration algorithm for identifying human disease genes

摘要

联合国可持续发展目标

访问文件

其它文件与链接

指纹

引用此