This study compares disease classification based on landmark and non-landmark gene expression data, and clinical variable using multiple machine-learning models. The influence of the number of principal components and the genes were also investigated. The results indicate that the ANN model has the best accuracy for disease type prediction among all the models, model using 95 principal components has better accuracy than that of 25 principal components, and the greater number of genes used, the higher the prediction accuracy. Models using landmark genes demonstrated better accuracy than the models using non-landmark genes especially with 95 PCs across all the models except for the decision trees. The optimal model was one that uses landmark genes with 95 PCs as features for an ANN classifier. The AUC measures obtained on the test set were 0.98,0.98,1 and 0.96 for Autoimmune, Bacteremia, Cancer and Healthy classes respectively, and the accuracy for the respective classes were 97.56%, 95.65%, 95.65%, and 58.82%. The ANN model demonstrated a good capability of distinguishing between the true positives and the false positives, and it resulted in high prediction accuracy for the 3 disease classes (Autoimmune, Bacteremia, Cancer), but it misclassified some instances from the Healthy class to the Autoimmune and Bacteremia class, likely due to a wide range of gene expression level for the Healthy class.
All Science Journal Classification (ASJC) codes
- Computer Science(all)