Scalable, updatable predictive models for sequence data

Neeraj Koul, Ngot Bui, Vasant Honavar

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

The emergence of data rich domains has led to an exponential growth in the size and number of data repositories, offering exciting opportunities to learn from the data using machine learning algorithms. In particular, sequence data is being made available at a rapid rate. In many applications, the learning algorithm may not have direct access to the entire dataset because of a variety of reasons such as massive data size or bandwidth limitation. In such settings, there is a need for techniques that can learn predictive models (e.g., classifiers) from large datasets without direct access to the data. We describe an approach to learn from massive sequence datasets using statistical queries. Specifically we show how Markov Models and Probabilistic Suffix Trees (PSTs) can be constructed from sequence databases that answer only a class of count queries. We analyze the query complexity (a measure of the number of queries needed) for constructing classifiers in such settings and outline some techniques to minimize the query complexity. We also show how some of the models can be updated in response to addition or deletion of subsets of sequences from the underlying sequence database.

Original languageEnglish (US)
Title of host publicationProceedings - 2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010
Pages681-685
Number of pages5
DOIs
StatePublished - Dec 1 2010
Event2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010 - Hong Kong, China
Duration: Dec 18 2010Dec 21 2010

Publication series

NameProceedings - 2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010

Other

Other2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010
CountryChina
CityHong Kong
Period12/18/1012/21/10

Fingerprint

Learning algorithms
Classifiers
Databases
Sequence Deletion
Statistical Models
Systems Analysis
Learning systems
Learning
Bandwidth
Growth
Datasets
Machine Learning
benzoylprop-ethyl

All Science Journal Classification (ASJC) codes

  • Biomedical Engineering
  • Health Informatics

Cite this

Koul, N., Bui, N., & Honavar, V. (2010). Scalable, updatable predictive models for sequence data. In Proceedings - 2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010 (pp. 681-685). [5706652] (Proceedings - 2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010). https://doi.org/10.1109/BIBM.2010.5706652
Koul, Neeraj ; Bui, Ngot ; Honavar, Vasant. / Scalable, updatable predictive models for sequence data. Proceedings - 2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010. 2010. pp. 681-685 (Proceedings - 2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010).
@inproceedings{946df141b7f54e6bbc3cfd9b5a741d01,
title = "Scalable, updatable predictive models for sequence data",
abstract = "The emergence of data rich domains has led to an exponential growth in the size and number of data repositories, offering exciting opportunities to learn from the data using machine learning algorithms. In particular, sequence data is being made available at a rapid rate. In many applications, the learning algorithm may not have direct access to the entire dataset because of a variety of reasons such as massive data size or bandwidth limitation. In such settings, there is a need for techniques that can learn predictive models (e.g., classifiers) from large datasets without direct access to the data. We describe an approach to learn from massive sequence datasets using statistical queries. Specifically we show how Markov Models and Probabilistic Suffix Trees (PSTs) can be constructed from sequence databases that answer only a class of count queries. We analyze the query complexity (a measure of the number of queries needed) for constructing classifiers in such settings and outline some techniques to minimize the query complexity. We also show how some of the models can be updated in response to addition or deletion of subsets of sequences from the underlying sequence database.",
author = "Neeraj Koul and Ngot Bui and Vasant Honavar",
year = "2010",
month = "12",
day = "1",
doi = "10.1109/BIBM.2010.5706652",
language = "English (US)",
isbn = "9781424483075",
series = "Proceedings - 2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010",
pages = "681--685",
booktitle = "Proceedings - 2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010",

}

Koul, N, Bui, N & Honavar, V 2010, Scalable, updatable predictive models for sequence data. in Proceedings - 2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010., 5706652, Proceedings - 2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010, pp. 681-685, 2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010, Hong Kong, China, 12/18/10. https://doi.org/10.1109/BIBM.2010.5706652

Scalable, updatable predictive models for sequence data. / Koul, Neeraj; Bui, Ngot; Honavar, Vasant.

Proceedings - 2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010. 2010. p. 681-685 5706652 (Proceedings - 2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Scalable, updatable predictive models for sequence data

AU - Koul, Neeraj

AU - Bui, Ngot

AU - Honavar, Vasant

PY - 2010/12/1

Y1 - 2010/12/1

N2 - The emergence of data rich domains has led to an exponential growth in the size and number of data repositories, offering exciting opportunities to learn from the data using machine learning algorithms. In particular, sequence data is being made available at a rapid rate. In many applications, the learning algorithm may not have direct access to the entire dataset because of a variety of reasons such as massive data size or bandwidth limitation. In such settings, there is a need for techniques that can learn predictive models (e.g., classifiers) from large datasets without direct access to the data. We describe an approach to learn from massive sequence datasets using statistical queries. Specifically we show how Markov Models and Probabilistic Suffix Trees (PSTs) can be constructed from sequence databases that answer only a class of count queries. We analyze the query complexity (a measure of the number of queries needed) for constructing classifiers in such settings and outline some techniques to minimize the query complexity. We also show how some of the models can be updated in response to addition or deletion of subsets of sequences from the underlying sequence database.

AB - The emergence of data rich domains has led to an exponential growth in the size and number of data repositories, offering exciting opportunities to learn from the data using machine learning algorithms. In particular, sequence data is being made available at a rapid rate. In many applications, the learning algorithm may not have direct access to the entire dataset because of a variety of reasons such as massive data size or bandwidth limitation. In such settings, there is a need for techniques that can learn predictive models (e.g., classifiers) from large datasets without direct access to the data. We describe an approach to learn from massive sequence datasets using statistical queries. Specifically we show how Markov Models and Probabilistic Suffix Trees (PSTs) can be constructed from sequence databases that answer only a class of count queries. We analyze the query complexity (a measure of the number of queries needed) for constructing classifiers in such settings and outline some techniques to minimize the query complexity. We also show how some of the models can be updated in response to addition or deletion of subsets of sequences from the underlying sequence database.

UR - http://www.scopus.com/inward/record.url?scp=79952430490&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79952430490&partnerID=8YFLogxK

U2 - 10.1109/BIBM.2010.5706652

DO - 10.1109/BIBM.2010.5706652

M3 - Conference contribution

AN - SCOPUS:79952430490

SN - 9781424483075

T3 - Proceedings - 2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010

SP - 681

EP - 685

BT - Proceedings - 2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010

ER -

Koul N, Bui N, Honavar V. Scalable, updatable predictive models for sequence data. In Proceedings - 2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010. 2010. p. 681-685. 5706652. (Proceedings - 2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010). https://doi.org/10.1109/BIBM.2010.5706652