TY - JOUR
T1 - Comparative Study of Disease Classification Using Multiple Machine Learning Models Based on Landmark and Non-Landmark Gene Expression Data
AU - Huang, Xiaoqin
AU - Sun, Jian
AU - Srinivasan, Satish Mahadevan
AU - Sangwan, Raghvinder S.
N1 - Publisher Copyright:
© 2021 Elsevier B.V.. All rights reserved.
PY - 2021
Y1 - 2021
N2 - This study compares disease classification based on landmark and non-landmark gene expression data, and clinical variable using multiple machine-learning models. The influence of the number of principal components and the genes were also investigated. The results indicate that the ANN model has the best accuracy for disease type prediction among all the models, model using 95 principal components has better accuracy than that of 25 principal components, and the greater number of genes used, the higher the prediction accuracy. Models using landmark genes demonstrated better accuracy than the models using non-landmark genes especially with 95 PCs across all the models except for the decision trees. The optimal model was one that uses landmark genes with 95 PCs as features for an ANN classifier. The AUC measures obtained on the test set were 0.98,0.98,1 and 0.96 for Autoimmune, Bacteremia, Cancer and Healthy classes respectively, and the accuracy for the respective classes were 97.56%, 95.65%, 95.65%, and 58.82%. The ANN model demonstrated a good capability of distinguishing between the true positives and the false positives, and it resulted in high prediction accuracy for the 3 disease classes (Autoimmune, Bacteremia, Cancer), but it misclassified some instances from the Healthy class to the Autoimmune and Bacteremia class, likely due to a wide range of gene expression level for the Healthy class.
AB - This study compares disease classification based on landmark and non-landmark gene expression data, and clinical variable using multiple machine-learning models. The influence of the number of principal components and the genes were also investigated. The results indicate that the ANN model has the best accuracy for disease type prediction among all the models, model using 95 principal components has better accuracy than that of 25 principal components, and the greater number of genes used, the higher the prediction accuracy. Models using landmark genes demonstrated better accuracy than the models using non-landmark genes especially with 95 PCs across all the models except for the decision trees. The optimal model was one that uses landmark genes with 95 PCs as features for an ANN classifier. The AUC measures obtained on the test set were 0.98,0.98,1 and 0.96 for Autoimmune, Bacteremia, Cancer and Healthy classes respectively, and the accuracy for the respective classes were 97.56%, 95.65%, 95.65%, and 58.82%. The ANN model demonstrated a good capability of distinguishing between the true positives and the false positives, and it resulted in high prediction accuracy for the 3 disease classes (Autoimmune, Bacteremia, Cancer), but it misclassified some instances from the Healthy class to the Autoimmune and Bacteremia class, likely due to a wide range of gene expression level for the Healthy class.
UR - http://www.scopus.com/inward/record.url?scp=85112705773&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85112705773&partnerID=8YFLogxK
U2 - 10.1016/j.procs.2021.05.028
DO - 10.1016/j.procs.2021.05.028
M3 - Conference article
AN - SCOPUS:85112705773
VL - 185
SP - 264
EP - 273
JO - Procedia Computer Science
JF - Procedia Computer Science
SN - 1877-0509
T2 - 2021 Complex Adaptive Systems Conference
Y2 - 16 June 2021 through 18 June 2021
ER -