TY - JOUR
T1 - Using feature selection and Bayesian network identify cancer subtypes based on proteomic data
AU - Wang, Yangyang
AU - Gao, Xiaoguang
AU - Ru, Xinxin
AU - Sun, Pengzhan
AU - Wang, Jihan
N1 - Publisher Copyright:
© 2023
PY - 2023/5/30
Y1 - 2023/5/30
N2 - The Cancer Proteome Atlas (TCPA) project collects reverse-phase protein arrays (RPPA)-based proteome datasets from nearly 8000 samples across 32 cancer types. This study aims to investigate the pan-cancer proteome signature and identify cancer subtypes of glioma, kidney cancer, and lung cancer based on TCPA data. We first visualized the tumor clustering models using t-distributed stochastic neighbour embedding (t-SNE) and bi-clustering heatmap. Then, three feature selection methods (pyHSICLasso, XGBoost, and Random Forest) were performed to select protein features for classifying cancer subtypes in training dataset, and the LibSVM algorithm was empolyed to test classification accuracy in the validation dataset. Clustering analysis revealed that different kinds of tumors have relatively distinct proteomic profiling based on tissue or origin. We identified 20, 10, and 20 protein features with the highest accuracies in classifying subtypes of glioma, kidney cancer, and lung cancer, respectively. The predictive abilities of the selected proteins were confirmed by receiving operating characteristic (ROC) analysis. Finally, the Bayesian network was utilized to explore the protein biomarkers that have direct causal relationships with cancer subtypes. Overall, we highlight the theoretical and technical applications of machine learning based feature selection approaches in the analysis of high-throughput biological data, particularly for cancer biomarker research. Significance: Functional proteomics is a powerful approach for characterizing cell signaling pathways and understanding their phenotypic effects on cancer development. The TCPA database provides a platform to explore and analyze TCGA pan-cancer RPPA-based protein expression. With the advent of the RPPA technology, the availability of high-throughput data in TCPA platform has made it possible to use machine learning methods to identify protein biomarkers and further differentiate subtypes of cancer based on proteomic data. In this study, we highlight the role of feature selection and Bayesian network in discovery protein biomarker for classifying cancer subtypes based on functional proteomic data. The application of machine learning methods in the analysis of high-throughput biological data, particularly for cancer biomarker researches, which have potential clinical values in developing individualized treatment strategies.
AB - The Cancer Proteome Atlas (TCPA) project collects reverse-phase protein arrays (RPPA)-based proteome datasets from nearly 8000 samples across 32 cancer types. This study aims to investigate the pan-cancer proteome signature and identify cancer subtypes of glioma, kidney cancer, and lung cancer based on TCPA data. We first visualized the tumor clustering models using t-distributed stochastic neighbour embedding (t-SNE) and bi-clustering heatmap. Then, three feature selection methods (pyHSICLasso, XGBoost, and Random Forest) were performed to select protein features for classifying cancer subtypes in training dataset, and the LibSVM algorithm was empolyed to test classification accuracy in the validation dataset. Clustering analysis revealed that different kinds of tumors have relatively distinct proteomic profiling based on tissue or origin. We identified 20, 10, and 20 protein features with the highest accuracies in classifying subtypes of glioma, kidney cancer, and lung cancer, respectively. The predictive abilities of the selected proteins were confirmed by receiving operating characteristic (ROC) analysis. Finally, the Bayesian network was utilized to explore the protein biomarkers that have direct causal relationships with cancer subtypes. Overall, we highlight the theoretical and technical applications of machine learning based feature selection approaches in the analysis of high-throughput biological data, particularly for cancer biomarker research. Significance: Functional proteomics is a powerful approach for characterizing cell signaling pathways and understanding their phenotypic effects on cancer development. The TCPA database provides a platform to explore and analyze TCGA pan-cancer RPPA-based protein expression. With the advent of the RPPA technology, the availability of high-throughput data in TCPA platform has made it possible to use machine learning methods to identify protein biomarkers and further differentiate subtypes of cancer based on proteomic data. In this study, we highlight the role of feature selection and Bayesian network in discovery protein biomarker for classifying cancer subtypes based on functional proteomic data. The application of machine learning methods in the analysis of high-throughput biological data, particularly for cancer biomarker researches, which have potential clinical values in developing individualized treatment strategies.
KW - Bayesian network (BN)
KW - Cancer subtype
KW - Feature selection
KW - The Cancer Proteome Atlas (TCPA)
UR - http://www.scopus.com/inward/record.url?scp=85151746411&partnerID=8YFLogxK
U2 - 10.1016/j.jprot.2023.104895
DO - 10.1016/j.jprot.2023.104895
M3 - 文章
C2 - 37024076
AN - SCOPUS:85151746411
SN - 1874-3919
VL - 280
JO - Journal of Proteomics
JF - Journal of Proteomics
M1 - 104895
ER -