Data-Driven Discovery of Protein Function Classifiers: Decision Trees Based on Meme Motifs Outperform Prosite Patterns and Profiles on Peptidase Families

Xiangyun Wang, Diane Schroeder, Drena Dobbs, Vasant Honavar

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

This paper describes an approach to data-driven discovery of decision trees or rules for assigning protein sequences to functional families using sequence motifs. This method is able to capture regularities that can be described in terms of presence or absence of arbitrary combinations of motifs. A training set of peptidase sequences labeled with the corresponding MEROPS functional families or clans is used to automatically construct decision trees that capture regularities that are sufficient to assign the sequences to their respective functional families. The performance of the resulting decision tree classifiers is then evaluated on an independent test set. We compared the rules constructed using motifs generated by a multiple sequence alignment based motif discovery tool (MEME) with rules constructed using expert annotated PROSITE motifs (patterns and profiles). Our results indicate that the former provide a potentially powerful high throughput technique for constructing protein function classifiers when adequate training data are available. Examination of the generated rules in the case of a Caspase (C14) family suggests that the proposed technique may be able to identify combinations of sequence motifs that characterize functionally significant 3-dimensional structural features of proteins.

Original languageEnglish (US)
Title of host publicationProceedings of the 6th Joint Conference on Information Sciences, JCIS 2002
EditorsJ.H. Caulfield, S.H. Chen, H.D. Cheng, R. Duro, J.H. Caufield, S.H. Chen, H.D. Cheng, R. Duro, V. Honavar
Pages1193-1199
Number of pages7
StatePublished - Dec 1 2002
EventProceedings of the 6th Joint Conference on Information Sciences, JCIS 2002 - Research Triange Park, NC, United States
Duration: Mar 8 2002Mar 13 2002

Publication series

NameProceedings of the Joint Conference on Information Sciences
Volume6

Other

OtherProceedings of the 6th Joint Conference on Information Sciences, JCIS 2002
CountryUnited States
CityResearch Triange Park, NC
Period3/8/023/13/02

Fingerprint

Decision trees
Classifiers
Proteins
Throughput
Peptide Hydrolases

All Science Journal Classification (ASJC) codes

  • Computer Science(all)

Cite this

Wang, X., Schroeder, D., Dobbs, D., & Honavar, V. (2002). Data-Driven Discovery of Protein Function Classifiers: Decision Trees Based on Meme Motifs Outperform Prosite Patterns and Profiles on Peptidase Families. In J. H. Caulfield, S. H. Chen, H. D. Cheng, R. Duro, J. H. Caufield, S. H. Chen, H. D. Cheng, R. Duro, ... V. Honavar (Eds.), Proceedings of the 6th Joint Conference on Information Sciences, JCIS 2002 (pp. 1193-1199). (Proceedings of the Joint Conference on Information Sciences; Vol. 6).
Wang, Xiangyun ; Schroeder, Diane ; Dobbs, Drena ; Honavar, Vasant. / Data-Driven Discovery of Protein Function Classifiers : Decision Trees Based on Meme Motifs Outperform Prosite Patterns and Profiles on Peptidase Families. Proceedings of the 6th Joint Conference on Information Sciences, JCIS 2002. editor / J.H. Caulfield ; S.H. Chen ; H.D. Cheng ; R. Duro ; J.H. Caufield ; S.H. Chen ; H.D. Cheng ; R. Duro ; V. Honavar. 2002. pp. 1193-1199 (Proceedings of the Joint Conference on Information Sciences).
@inproceedings{d614968edf144db89a32f6e2a32766e8,
title = "Data-Driven Discovery of Protein Function Classifiers: Decision Trees Based on Meme Motifs Outperform Prosite Patterns and Profiles on Peptidase Families",
abstract = "This paper describes an approach to data-driven discovery of decision trees or rules for assigning protein sequences to functional families using sequence motifs. This method is able to capture regularities that can be described in terms of presence or absence of arbitrary combinations of motifs. A training set of peptidase sequences labeled with the corresponding MEROPS functional families or clans is used to automatically construct decision trees that capture regularities that are sufficient to assign the sequences to their respective functional families. The performance of the resulting decision tree classifiers is then evaluated on an independent test set. We compared the rules constructed using motifs generated by a multiple sequence alignment based motif discovery tool (MEME) with rules constructed using expert annotated PROSITE motifs (patterns and profiles). Our results indicate that the former provide a potentially powerful high throughput technique for constructing protein function classifiers when adequate training data are available. Examination of the generated rules in the case of a Caspase (C14) family suggests that the proposed technique may be able to identify combinations of sequence motifs that characterize functionally significant 3-dimensional structural features of proteins.",
author = "Xiangyun Wang and Diane Schroeder and Drena Dobbs and Vasant Honavar",
year = "2002",
month = "12",
day = "1",
language = "English (US)",
isbn = "0970789017",
series = "Proceedings of the Joint Conference on Information Sciences",
pages = "1193--1199",
editor = "J.H. Caulfield and S.H. Chen and H.D. Cheng and R. Duro and J.H. Caufield and S.H. Chen and H.D. Cheng and R. Duro and V. Honavar",
booktitle = "Proceedings of the 6th Joint Conference on Information Sciences, JCIS 2002",

}

Wang, X, Schroeder, D, Dobbs, D & Honavar, V 2002, Data-Driven Discovery of Protein Function Classifiers: Decision Trees Based on Meme Motifs Outperform Prosite Patterns and Profiles on Peptidase Families. in JH Caulfield, SH Chen, HD Cheng, R Duro, JH Caufield, SH Chen, HD Cheng, R Duro & V Honavar (eds), Proceedings of the 6th Joint Conference on Information Sciences, JCIS 2002. Proceedings of the Joint Conference on Information Sciences, vol. 6, pp. 1193-1199, Proceedings of the 6th Joint Conference on Information Sciences, JCIS 2002, Research Triange Park, NC, United States, 3/8/02.

Data-Driven Discovery of Protein Function Classifiers : Decision Trees Based on Meme Motifs Outperform Prosite Patterns and Profiles on Peptidase Families. / Wang, Xiangyun; Schroeder, Diane; Dobbs, Drena; Honavar, Vasant.

Proceedings of the 6th Joint Conference on Information Sciences, JCIS 2002. ed. / J.H. Caulfield; S.H. Chen; H.D. Cheng; R. Duro; J.H. Caufield; S.H. Chen; H.D. Cheng; R. Duro; V. Honavar. 2002. p. 1193-1199 (Proceedings of the Joint Conference on Information Sciences; Vol. 6).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Data-Driven Discovery of Protein Function Classifiers

T2 - Decision Trees Based on Meme Motifs Outperform Prosite Patterns and Profiles on Peptidase Families

AU - Wang, Xiangyun

AU - Schroeder, Diane

AU - Dobbs, Drena

AU - Honavar, Vasant

PY - 2002/12/1

Y1 - 2002/12/1

N2 - This paper describes an approach to data-driven discovery of decision trees or rules for assigning protein sequences to functional families using sequence motifs. This method is able to capture regularities that can be described in terms of presence or absence of arbitrary combinations of motifs. A training set of peptidase sequences labeled with the corresponding MEROPS functional families or clans is used to automatically construct decision trees that capture regularities that are sufficient to assign the sequences to their respective functional families. The performance of the resulting decision tree classifiers is then evaluated on an independent test set. We compared the rules constructed using motifs generated by a multiple sequence alignment based motif discovery tool (MEME) with rules constructed using expert annotated PROSITE motifs (patterns and profiles). Our results indicate that the former provide a potentially powerful high throughput technique for constructing protein function classifiers when adequate training data are available. Examination of the generated rules in the case of a Caspase (C14) family suggests that the proposed technique may be able to identify combinations of sequence motifs that characterize functionally significant 3-dimensional structural features of proteins.

AB - This paper describes an approach to data-driven discovery of decision trees or rules for assigning protein sequences to functional families using sequence motifs. This method is able to capture regularities that can be described in terms of presence or absence of arbitrary combinations of motifs. A training set of peptidase sequences labeled with the corresponding MEROPS functional families or clans is used to automatically construct decision trees that capture regularities that are sufficient to assign the sequences to their respective functional families. The performance of the resulting decision tree classifiers is then evaluated on an independent test set. We compared the rules constructed using motifs generated by a multiple sequence alignment based motif discovery tool (MEME) with rules constructed using expert annotated PROSITE motifs (patterns and profiles). Our results indicate that the former provide a potentially powerful high throughput technique for constructing protein function classifiers when adequate training data are available. Examination of the generated rules in the case of a Caspase (C14) family suggests that the proposed technique may be able to identify combinations of sequence motifs that characterize functionally significant 3-dimensional structural features of proteins.

UR - http://www.scopus.com/inward/record.url?scp=1642333036&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=1642333036&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:1642333036

SN - 0970789017

T3 - Proceedings of the Joint Conference on Information Sciences

SP - 1193

EP - 1199

BT - Proceedings of the 6th Joint Conference on Information Sciences, JCIS 2002

A2 - Caulfield, J.H.

A2 - Chen, S.H.

A2 - Cheng, H.D.

A2 - Duro, R.

A2 - Caufield, J.H.

A2 - Chen, S.H.

A2 - Cheng, H.D.

A2 - Duro, R.

A2 - Honavar, V.

ER -

Wang X, Schroeder D, Dobbs D, Honavar V. Data-Driven Discovery of Protein Function Classifiers: Decision Trees Based on Meme Motifs Outperform Prosite Patterns and Profiles on Peptidase Families. In Caulfield JH, Chen SH, Cheng HD, Duro R, Caufield JH, Chen SH, Cheng HD, Duro R, Honavar V, editors, Proceedings of the 6th Joint Conference on Information Sciences, JCIS 2002. 2002. p. 1193-1199. (Proceedings of the Joint Conference on Information Sciences).