Automated data-driven discovery of motif-based protein function classifiers

Xiangyun Wang, Diane Schroeder, Drena Dobbs, Vasant Honavar

Research output: Contribution to journalArticle

24 Citations (Scopus)

Abstract

This paper describes an approach to data-driven discovery of decision trees or rules for assigning protein sequences to functional families using sequence motifs. This method is able to capture regularities that can be described in terms of presence or absence of arbitrary combinations of motifs. A training set of peptidase sequences labeled with the corresponding MEROPS functional families or clans is used to automatically construct decision trees that capture regularities sufficient to assign the sequences to their respective functional families. The performance of the resulting decision tree classifiers is then evaluated on an independent test set. We compared the rules constructed using motifs generated by a multiple sequence alignment based motif discovery tool (MEME) with rules constructed using expert annotated PROSITE motifs (patterns and profiles). Our results indicate that the former provide a potentially powerful high throughput technique for constructing protein function classifiers when adequate training data are available. Examination of the generated rules in relation to known three-dimensional structures of members in the case of two families (MEROPS families C14 and M12) suggests that the proposed technique may be able to identify combinations of sequence motifs that characterize functionally significant three-dimensional structural features of proteins.

Original languageEnglish (US)
Pages (from-to)1-18
Number of pages18
JournalInformation Sciences
Volume155
Issue number1-2
DOIs
StatePublished - Oct 1 2003

Fingerprint

Decision trees
Data-driven
Classifiers
Classifier
Proteins
Protein
Decision tree
Regularity
Motif Discovery
Three-dimensional
Throughput
Multiple Sequence Alignment
Protein Sequence
Test Set
Decision Rules
Independent Set
High Throughput
Assign
Family
Sufficient

All Science Journal Classification (ASJC) codes

  • Software
  • Control and Systems Engineering
  • Theoretical Computer Science
  • Computer Science Applications
  • Information Systems and Management
  • Artificial Intelligence

Cite this

Wang, Xiangyun ; Schroeder, Diane ; Dobbs, Drena ; Honavar, Vasant. / Automated data-driven discovery of motif-based protein function classifiers. In: Information Sciences. 2003 ; Vol. 155, No. 1-2. pp. 1-18.
@article{ce91d6b4da334bb0ace647b7d4f4a0ab,
title = "Automated data-driven discovery of motif-based protein function classifiers",
abstract = "This paper describes an approach to data-driven discovery of decision trees or rules for assigning protein sequences to functional families using sequence motifs. This method is able to capture regularities that can be described in terms of presence or absence of arbitrary combinations of motifs. A training set of peptidase sequences labeled with the corresponding MEROPS functional families or clans is used to automatically construct decision trees that capture regularities sufficient to assign the sequences to their respective functional families. The performance of the resulting decision tree classifiers is then evaluated on an independent test set. We compared the rules constructed using motifs generated by a multiple sequence alignment based motif discovery tool (MEME) with rules constructed using expert annotated PROSITE motifs (patterns and profiles). Our results indicate that the former provide a potentially powerful high throughput technique for constructing protein function classifiers when adequate training data are available. Examination of the generated rules in relation to known three-dimensional structures of members in the case of two families (MEROPS families C14 and M12) suggests that the proposed technique may be able to identify combinations of sequence motifs that characterize functionally significant three-dimensional structural features of proteins.",
author = "Xiangyun Wang and Diane Schroeder and Drena Dobbs and Vasant Honavar",
year = "2003",
month = "10",
day = "1",
doi = "10.1016/S0020-0255(03)00067-7",
language = "English (US)",
volume = "155",
pages = "1--18",
journal = "Information Sciences",
issn = "0020-0255",
publisher = "Elsevier Inc.",
number = "1-2",

}

Automated data-driven discovery of motif-based protein function classifiers. / Wang, Xiangyun; Schroeder, Diane; Dobbs, Drena; Honavar, Vasant.

In: Information Sciences, Vol. 155, No. 1-2, 01.10.2003, p. 1-18.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Automated data-driven discovery of motif-based protein function classifiers

AU - Wang, Xiangyun

AU - Schroeder, Diane

AU - Dobbs, Drena

AU - Honavar, Vasant

PY - 2003/10/1

Y1 - 2003/10/1

N2 - This paper describes an approach to data-driven discovery of decision trees or rules for assigning protein sequences to functional families using sequence motifs. This method is able to capture regularities that can be described in terms of presence or absence of arbitrary combinations of motifs. A training set of peptidase sequences labeled with the corresponding MEROPS functional families or clans is used to automatically construct decision trees that capture regularities sufficient to assign the sequences to their respective functional families. The performance of the resulting decision tree classifiers is then evaluated on an independent test set. We compared the rules constructed using motifs generated by a multiple sequence alignment based motif discovery tool (MEME) with rules constructed using expert annotated PROSITE motifs (patterns and profiles). Our results indicate that the former provide a potentially powerful high throughput technique for constructing protein function classifiers when adequate training data are available. Examination of the generated rules in relation to known three-dimensional structures of members in the case of two families (MEROPS families C14 and M12) suggests that the proposed technique may be able to identify combinations of sequence motifs that characterize functionally significant three-dimensional structural features of proteins.

AB - This paper describes an approach to data-driven discovery of decision trees or rules for assigning protein sequences to functional families using sequence motifs. This method is able to capture regularities that can be described in terms of presence or absence of arbitrary combinations of motifs. A training set of peptidase sequences labeled with the corresponding MEROPS functional families or clans is used to automatically construct decision trees that capture regularities sufficient to assign the sequences to their respective functional families. The performance of the resulting decision tree classifiers is then evaluated on an independent test set. We compared the rules constructed using motifs generated by a multiple sequence alignment based motif discovery tool (MEME) with rules constructed using expert annotated PROSITE motifs (patterns and profiles). Our results indicate that the former provide a potentially powerful high throughput technique for constructing protein function classifiers when adequate training data are available. Examination of the generated rules in relation to known three-dimensional structures of members in the case of two families (MEROPS families C14 and M12) suggests that the proposed technique may be able to identify combinations of sequence motifs that characterize functionally significant three-dimensional structural features of proteins.

UR - http://www.scopus.com/inward/record.url?scp=0042235504&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0042235504&partnerID=8YFLogxK

U2 - 10.1016/S0020-0255(03)00067-7

DO - 10.1016/S0020-0255(03)00067-7

M3 - Article

AN - SCOPUS:0042235504

VL - 155

SP - 1

EP - 18

JO - Information Sciences

JF - Information Sciences

SN - 0020-0255

IS - 1-2

ER -