Assessing the performance of macromolecular sequence classifiers

Cornelia Caragea, Jivko Sinapov, Vasant Honavar, Drena Dobbs

Research output: Chapter in Book/Report/Conference proceedingConference contribution

14 Citations (Scopus)

Abstract

Machine learning approaches offer some of the most cost-effective approaches to building predictive models (e.g., classifiers) in a broad range of applications in computational biology. Comparing the effectiveness of different algorithms requires reliable procedures for accurately assessing the performance (e.g., accuracy, sensitivity, and specificity) of the resulting predictive classifiers. The difficulty of this task is compounded by the use of different data selection and evaluation procedures and in some cases, even different definitions for the same performance measures. We explore the problem of assessing the performance of predictive classifiers trained on macromolecular sequence data, with an emphasis on cross-validation and data selection methods. Specifically, we compare sequence-based and window-based cross-validation procedures on three sequence-based prediction tasks: identification of glycosylation sites, RNA-Protein interface residues, and Protein-Protein interface residues from amino acid sequence. Our experiments with two representative classifiers (Naive Bayes and Support Vector Machine) show that sequence-based and windows-based cross-validation procedures and data selection methods can yield different estimates of commonly used performance measures such as accuracy, Matthews correlation coefficient and area under the Receiver Operating Characteristic curve. We argue that the performance estimates obtained using sequence-based cross-validation provide more realistic estimates of performance than those obtained using window-based cross-validation.

Original languageEnglish (US)
Title of host publicationProceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE
Pages320-326
Number of pages7
DOIs
StatePublished - Dec 1 2007
Event7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE - Boston, MA, United States
Duration: Jan 14 2007Jan 17 2007

Publication series

NameProceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE

Other

Other7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE
CountryUnited States
CityBoston, MA
Period1/14/071/17/07

Fingerprint

Classifiers
Proteins
Glycosylation
RNA
Support vector machines
Learning systems
Amino acids
Computational Biology
Amino Acids
ROC Curve
Amino Acid Sequence
Costs
Costs and Cost Analysis
Sensitivity and Specificity
Experiments

All Science Journal Classification (ASJC) codes

  • Biotechnology
  • Genetics
  • Bioengineering

Cite this

Caragea, C., Sinapov, J., Honavar, V., & Dobbs, D. (2007). Assessing the performance of macromolecular sequence classifiers. In Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE (pp. 320-326). [4375583] (Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE). https://doi.org/10.1109/BIBE.2007.4375583
Caragea, Cornelia ; Sinapov, Jivko ; Honavar, Vasant ; Dobbs, Drena. / Assessing the performance of macromolecular sequence classifiers. Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE. 2007. pp. 320-326 (Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE).
@inproceedings{a33f677d275a492bb24a89dc6830f635,
title = "Assessing the performance of macromolecular sequence classifiers",
abstract = "Machine learning approaches offer some of the most cost-effective approaches to building predictive models (e.g., classifiers) in a broad range of applications in computational biology. Comparing the effectiveness of different algorithms requires reliable procedures for accurately assessing the performance (e.g., accuracy, sensitivity, and specificity) of the resulting predictive classifiers. The difficulty of this task is compounded by the use of different data selection and evaluation procedures and in some cases, even different definitions for the same performance measures. We explore the problem of assessing the performance of predictive classifiers trained on macromolecular sequence data, with an emphasis on cross-validation and data selection methods. Specifically, we compare sequence-based and window-based cross-validation procedures on three sequence-based prediction tasks: identification of glycosylation sites, RNA-Protein interface residues, and Protein-Protein interface residues from amino acid sequence. Our experiments with two representative classifiers (Naive Bayes and Support Vector Machine) show that sequence-based and windows-based cross-validation procedures and data selection methods can yield different estimates of commonly used performance measures such as accuracy, Matthews correlation coefficient and area under the Receiver Operating Characteristic curve. We argue that the performance estimates obtained using sequence-based cross-validation provide more realistic estimates of performance than those obtained using window-based cross-validation.",
author = "Cornelia Caragea and Jivko Sinapov and Vasant Honavar and Drena Dobbs",
year = "2007",
month = "12",
day = "1",
doi = "10.1109/BIBE.2007.4375583",
language = "English (US)",
isbn = "1424415098",
series = "Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE",
pages = "320--326",
booktitle = "Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE",

}

Caragea, C, Sinapov, J, Honavar, V & Dobbs, D 2007, Assessing the performance of macromolecular sequence classifiers. in Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE., 4375583, Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE, pp. 320-326, 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE, Boston, MA, United States, 1/14/07. https://doi.org/10.1109/BIBE.2007.4375583

Assessing the performance of macromolecular sequence classifiers. / Caragea, Cornelia; Sinapov, Jivko; Honavar, Vasant; Dobbs, Drena.

Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE. 2007. p. 320-326 4375583 (Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Assessing the performance of macromolecular sequence classifiers

AU - Caragea, Cornelia

AU - Sinapov, Jivko

AU - Honavar, Vasant

AU - Dobbs, Drena

PY - 2007/12/1

Y1 - 2007/12/1

N2 - Machine learning approaches offer some of the most cost-effective approaches to building predictive models (e.g., classifiers) in a broad range of applications in computational biology. Comparing the effectiveness of different algorithms requires reliable procedures for accurately assessing the performance (e.g., accuracy, sensitivity, and specificity) of the resulting predictive classifiers. The difficulty of this task is compounded by the use of different data selection and evaluation procedures and in some cases, even different definitions for the same performance measures. We explore the problem of assessing the performance of predictive classifiers trained on macromolecular sequence data, with an emphasis on cross-validation and data selection methods. Specifically, we compare sequence-based and window-based cross-validation procedures on three sequence-based prediction tasks: identification of glycosylation sites, RNA-Protein interface residues, and Protein-Protein interface residues from amino acid sequence. Our experiments with two representative classifiers (Naive Bayes and Support Vector Machine) show that sequence-based and windows-based cross-validation procedures and data selection methods can yield different estimates of commonly used performance measures such as accuracy, Matthews correlation coefficient and area under the Receiver Operating Characteristic curve. We argue that the performance estimates obtained using sequence-based cross-validation provide more realistic estimates of performance than those obtained using window-based cross-validation.

AB - Machine learning approaches offer some of the most cost-effective approaches to building predictive models (e.g., classifiers) in a broad range of applications in computational biology. Comparing the effectiveness of different algorithms requires reliable procedures for accurately assessing the performance (e.g., accuracy, sensitivity, and specificity) of the resulting predictive classifiers. The difficulty of this task is compounded by the use of different data selection and evaluation procedures and in some cases, even different definitions for the same performance measures. We explore the problem of assessing the performance of predictive classifiers trained on macromolecular sequence data, with an emphasis on cross-validation and data selection methods. Specifically, we compare sequence-based and window-based cross-validation procedures on three sequence-based prediction tasks: identification of glycosylation sites, RNA-Protein interface residues, and Protein-Protein interface residues from amino acid sequence. Our experiments with two representative classifiers (Naive Bayes and Support Vector Machine) show that sequence-based and windows-based cross-validation procedures and data selection methods can yield different estimates of commonly used performance measures such as accuracy, Matthews correlation coefficient and area under the Receiver Operating Characteristic curve. We argue that the performance estimates obtained using sequence-based cross-validation provide more realistic estimates of performance than those obtained using window-based cross-validation.

UR - http://www.scopus.com/inward/record.url?scp=47649094232&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=47649094232&partnerID=8YFLogxK

U2 - 10.1109/BIBE.2007.4375583

DO - 10.1109/BIBE.2007.4375583

M3 - Conference contribution

AN - SCOPUS:47649094232

SN - 1424415098

SN - 9781424415090

T3 - Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE

SP - 320

EP - 326

BT - Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE

ER -

Caragea C, Sinapov J, Honavar V, Dobbs D. Assessing the performance of macromolecular sequence classifiers. In Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE. 2007. p. 320-326. 4375583. (Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE). https://doi.org/10.1109/BIBE.2007.4375583