TY - JOUR
T1 - Emergent unsupervised clustering paradigms with potential application to bioinformatics
AU - Miller, David J.
AU - Wang, Yue
AU - Kesidis, George
N1 - Copyright:
Copyright 2008 Elsevier B.V., All rights reserved.
PY - 2008
Y1 - 2008
N2 - In recent years, there has been a great upsurge in the application of data clustering, statistical classification, and related machine learning techniques to the field of molecular biology, in particular analysis of DNA microarray expression data. Clustering methods can be used to group co-expressed genes, shedding light on gene function and co-regulation. Alternatively, they can group samples or conditions to identify phenotypical groups, disease subgroups, or to help identify disease pathways. A rich variety of unsupervised techniques have been applied, including partitional, hierarchical, graph-based, model-based, and biclustering methods. While a number of machine learning problems and tools have found mainstream applications in bioinformatics, in this article we identify some challenging problems which, though clearly relevant to bioinformatics, have not been extensively investigated in this domain. These include i) unsupervised clustering with unsupervised feature selection, ii) semisupervised learning, iii) unsupervised learning (and supervised learning) in the presence of confounding variables, and iv) stability of clustering solutions. We review recent methods which address these problems and take the position that these methods are well-suited to addressing some common scenarios that occur in bioinformatics.
AB - In recent years, there has been a great upsurge in the application of data clustering, statistical classification, and related machine learning techniques to the field of molecular biology, in particular analysis of DNA microarray expression data. Clustering methods can be used to group co-expressed genes, shedding light on gene function and co-regulation. Alternatively, they can group samples or conditions to identify phenotypical groups, disease subgroups, or to help identify disease pathways. A rich variety of unsupervised techniques have been applied, including partitional, hierarchical, graph-based, model-based, and biclustering methods. While a number of machine learning problems and tools have found mainstream applications in bioinformatics, in this article we identify some challenging problems which, though clearly relevant to bioinformatics, have not been extensively investigated in this domain. These include i) unsupervised clustering with unsupervised feature selection, ii) semisupervised learning, iii) unsupervised learning (and supervised learning) in the presence of confounding variables, and iv) stability of clustering solutions. We review recent methods which address these problems and take the position that these methods are well-suited to addressing some common scenarios that occur in bioinformatics.
UR - http://www.scopus.com/inward/record.url?scp=37549070609&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=37549070609&partnerID=8YFLogxK
U2 - 10.2741/2711
DO - 10.2741/2711
M3 - Review article
C2 - 17981579
AN - SCOPUS:37549070609
VL - 13
SP - 677
EP - 690
JO - Frontiers in Bioscience - Landmark
JF - Frontiers in Bioscience - Landmark
SN - 1093-9946
IS - 2
ER -