A top-down method for mining most specific frequent patterns in biological sequence data

Martin Ester, Xiang Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

14 Citations (Scopus)

Abstract

The emergence of automated high-throughput sequencing technologies has resulted in a huge increase of the amount of DNA and protein sequences available in public databases. A promising approach for mining such biological sequence data is mining frequent subsequences. One way to limit the number of patterns discovered is to determine only the most specific frequent subsequences which subsume a large number of more general patterns. In the biological domain, a wealth of knowledge on the relationships between the symbols of the underlying alphabets (in particular, amino acids) of the sequences has been acquired, which can be represented in concept graphs. Using such concept graphs, much longer frequent patterns can be discovered which are more meaningful from a biological point of view. In this paper, we introduce the problem of mining most specific frequent patterns in biological data in the presence of concept graphs. While the well-known methods for frequent sequence mining typically follow the paradigm of bottom-up pattern generation, we present a novel top-down method (ToMMS) for mining such patterns. ToMMS (1) always generates more specific patterns before more general ones and (2) performs only minimal generalizations of infrequent candidate sequences. Due to these properties, the number of patterns generated and tested is minimized. Our experimental results demonstrate that ToMMS clearly outperforms state-of-the-art methods from the bioinformatics community as well as from the data mining community for reasonably low minimum support thresholds.

Original languageEnglish (US)
Title of host publicationProceedings of the Fourth SIAM International Conference on Data Mining
EditorsM.W. Berry, U. Dayal, C. Kamath, D. Skillicorn
Pages90-101
Number of pages12
StatePublished - 2004
EventProceedings of the Fourth SIAM International Conference on Data Mining - Lake Buena Vista, FL, United States
Duration: Apr 22 2004Apr 24 2004

Other

OtherProceedings of the Fourth SIAM International Conference on Data Mining
CountryUnited States
CityLake Buena Vista, FL
Period4/22/044/24/04

Fingerprint

Frequent Pattern
Mining
Subsequence
Graph in graph theory
Protein Sequence
Bottom-up
DNA Sequence
Sequencing
High Throughput
Amino Acids
Bioinformatics
Data Mining
Paradigm
Experimental Results
Demonstrate
Concepts

All Science Journal Classification (ASJC) codes

  • Mathematics(all)

Cite this

Ester, M., & Zhang, X. (2004). A top-down method for mining most specific frequent patterns in biological sequence data. In M. W. Berry, U. Dayal, C. Kamath, & D. Skillicorn (Eds.), Proceedings of the Fourth SIAM International Conference on Data Mining (pp. 90-101)
Ester, Martin ; Zhang, Xiang. / A top-down method for mining most specific frequent patterns in biological sequence data. Proceedings of the Fourth SIAM International Conference on Data Mining. editor / M.W. Berry ; U. Dayal ; C. Kamath ; D. Skillicorn. 2004. pp. 90-101
@inproceedings{ba7e4d570e474c7cb31c7e57e570bc18,
title = "A top-down method for mining most specific frequent patterns in biological sequence data",
abstract = "The emergence of automated high-throughput sequencing technologies has resulted in a huge increase of the amount of DNA and protein sequences available in public databases. A promising approach for mining such biological sequence data is mining frequent subsequences. One way to limit the number of patterns discovered is to determine only the most specific frequent subsequences which subsume a large number of more general patterns. In the biological domain, a wealth of knowledge on the relationships between the symbols of the underlying alphabets (in particular, amino acids) of the sequences has been acquired, which can be represented in concept graphs. Using such concept graphs, much longer frequent patterns can be discovered which are more meaningful from a biological point of view. In this paper, we introduce the problem of mining most specific frequent patterns in biological data in the presence of concept graphs. While the well-known methods for frequent sequence mining typically follow the paradigm of bottom-up pattern generation, we present a novel top-down method (ToMMS) for mining such patterns. ToMMS (1) always generates more specific patterns before more general ones and (2) performs only minimal generalizations of infrequent candidate sequences. Due to these properties, the number of patterns generated and tested is minimized. Our experimental results demonstrate that ToMMS clearly outperforms state-of-the-art methods from the bioinformatics community as well as from the data mining community for reasonably low minimum support thresholds.",
author = "Martin Ester and Xiang Zhang",
year = "2004",
language = "English (US)",
pages = "90--101",
editor = "M.W. Berry and U. Dayal and C. Kamath and D. Skillicorn",
booktitle = "Proceedings of the Fourth SIAM International Conference on Data Mining",

}

Ester, M & Zhang, X 2004, A top-down method for mining most specific frequent patterns in biological sequence data. in MW Berry, U Dayal, C Kamath & D Skillicorn (eds), Proceedings of the Fourth SIAM International Conference on Data Mining. pp. 90-101, Proceedings of the Fourth SIAM International Conference on Data Mining, Lake Buena Vista, FL, United States, 4/22/04.

A top-down method for mining most specific frequent patterns in biological sequence data. / Ester, Martin; Zhang, Xiang.

Proceedings of the Fourth SIAM International Conference on Data Mining. ed. / M.W. Berry; U. Dayal; C. Kamath; D. Skillicorn. 2004. p. 90-101.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - A top-down method for mining most specific frequent patterns in biological sequence data

AU - Ester, Martin

AU - Zhang, Xiang

PY - 2004

Y1 - 2004

N2 - The emergence of automated high-throughput sequencing technologies has resulted in a huge increase of the amount of DNA and protein sequences available in public databases. A promising approach for mining such biological sequence data is mining frequent subsequences. One way to limit the number of patterns discovered is to determine only the most specific frequent subsequences which subsume a large number of more general patterns. In the biological domain, a wealth of knowledge on the relationships between the symbols of the underlying alphabets (in particular, amino acids) of the sequences has been acquired, which can be represented in concept graphs. Using such concept graphs, much longer frequent patterns can be discovered which are more meaningful from a biological point of view. In this paper, we introduce the problem of mining most specific frequent patterns in biological data in the presence of concept graphs. While the well-known methods for frequent sequence mining typically follow the paradigm of bottom-up pattern generation, we present a novel top-down method (ToMMS) for mining such patterns. ToMMS (1) always generates more specific patterns before more general ones and (2) performs only minimal generalizations of infrequent candidate sequences. Due to these properties, the number of patterns generated and tested is minimized. Our experimental results demonstrate that ToMMS clearly outperforms state-of-the-art methods from the bioinformatics community as well as from the data mining community for reasonably low minimum support thresholds.

AB - The emergence of automated high-throughput sequencing technologies has resulted in a huge increase of the amount of DNA and protein sequences available in public databases. A promising approach for mining such biological sequence data is mining frequent subsequences. One way to limit the number of patterns discovered is to determine only the most specific frequent subsequences which subsume a large number of more general patterns. In the biological domain, a wealth of knowledge on the relationships between the symbols of the underlying alphabets (in particular, amino acids) of the sequences has been acquired, which can be represented in concept graphs. Using such concept graphs, much longer frequent patterns can be discovered which are more meaningful from a biological point of view. In this paper, we introduce the problem of mining most specific frequent patterns in biological data in the presence of concept graphs. While the well-known methods for frequent sequence mining typically follow the paradigm of bottom-up pattern generation, we present a novel top-down method (ToMMS) for mining such patterns. ToMMS (1) always generates more specific patterns before more general ones and (2) performs only minimal generalizations of infrequent candidate sequences. Due to these properties, the number of patterns generated and tested is minimized. Our experimental results demonstrate that ToMMS clearly outperforms state-of-the-art methods from the bioinformatics community as well as from the data mining community for reasonably low minimum support thresholds.

UR - http://www.scopus.com/inward/record.url?scp=2942618802&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=2942618802&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:2942618802

SP - 90

EP - 101

BT - Proceedings of the Fourth SIAM International Conference on Data Mining

A2 - Berry, M.W.

A2 - Dayal, U.

A2 - Kamath, C.

A2 - Skillicorn, D.

ER -

Ester M, Zhang X. A top-down method for mining most specific frequent patterns in biological sequence data. In Berry MW, Dayal U, Kamath C, Skillicorn D, editors, Proceedings of the Fourth SIAM International Conference on Data Mining. 2004. p. 90-101