ESPERR: Learning strong and weak signals in genomic sequence alignments to identify functional elements

James Taylor, Svitlana Tyekucheva, David C. King, Ross C. Hardison, Webb Miller, Francesca Chiaromonte

Research output: Contribution to journalArticle

94 Citations (Scopus)

Abstract

Genomic sequence signals - such as base composition, presence of particular motifs, or evolutionary constraint - have been used effectively to identify functional elements. However, approaches based only on specific signals known to correlate with function can be quite limiting. When training data are available, application of computational learning algorithms to multispecies alignments has the potential to capture broader and more informative sequence and evolutionary patterns that better characterize a class of elements. However, effective exploitation of patterns in multispecies alignments is impeded by the vast number of possible alignment columns and by a limited understanding of which particular strings of columns may characterize a given class. We have developed a computational method, called ESPERR (evolutionary and sequence pattern extraction through reduced representations), which uses training examples to learn encodings of multispecies alignments into reduced forms tailored for the prediction of chosen classes of functional elements. ESPERR produces a greatly improved Regulatory Potential score, which can discriminate regulatory regions from neutral sites with excellent accuracy (∼94%). This score captures strong signals (GC content and conservation), as well as subtler signals (with small contributions from many different alignment patterns) that characterize the regulatory elements in our training set. ESPERR is also effective for predicting other classes of functional elements, as we show for DNaseI hypersensitive sites and highly conserved regions with developmental enhancer activity. Our software, training data, and genome-wide predictions are available from our Web site (http://www.bx.psu.edu/projects/esperr).

Original languageEnglish (US)
Pages (from-to)1596-1604
Number of pages9
JournalGenome research
Volume16
Issue number12
DOIs
StatePublished - Dec 1 2006

Fingerprint

Sequence Alignment
Base Composition
Learning
Nucleic Acid Regulatory Sequences
Protein Sorting Signals
Software
Genome

All Science Journal Classification (ASJC) codes

  • Genetics
  • Genetics(clinical)

Cite this

Taylor, James ; Tyekucheva, Svitlana ; King, David C. ; Hardison, Ross C. ; Miller, Webb ; Chiaromonte, Francesca. / ESPERR : Learning strong and weak signals in genomic sequence alignments to identify functional elements. In: Genome research. 2006 ; Vol. 16, No. 12. pp. 1596-1604.
@article{37c82fa64c274be59fea29a2ccdb1e88,
title = "ESPERR: Learning strong and weak signals in genomic sequence alignments to identify functional elements",
abstract = "Genomic sequence signals - such as base composition, presence of particular motifs, or evolutionary constraint - have been used effectively to identify functional elements. However, approaches based only on specific signals known to correlate with function can be quite limiting. When training data are available, application of computational learning algorithms to multispecies alignments has the potential to capture broader and more informative sequence and evolutionary patterns that better characterize a class of elements. However, effective exploitation of patterns in multispecies alignments is impeded by the vast number of possible alignment columns and by a limited understanding of which particular strings of columns may characterize a given class. We have developed a computational method, called ESPERR (evolutionary and sequence pattern extraction through reduced representations), which uses training examples to learn encodings of multispecies alignments into reduced forms tailored for the prediction of chosen classes of functional elements. ESPERR produces a greatly improved Regulatory Potential score, which can discriminate regulatory regions from neutral sites with excellent accuracy (∼94{\%}). This score captures strong signals (GC content and conservation), as well as subtler signals (with small contributions from many different alignment patterns) that characterize the regulatory elements in our training set. ESPERR is also effective for predicting other classes of functional elements, as we show for DNaseI hypersensitive sites and highly conserved regions with developmental enhancer activity. Our software, training data, and genome-wide predictions are available from our Web site (http://www.bx.psu.edu/projects/esperr).",
author = "James Taylor and Svitlana Tyekucheva and King, {David C.} and Hardison, {Ross C.} and Webb Miller and Francesca Chiaromonte",
year = "2006",
month = "12",
day = "1",
doi = "10.1101/gr.4537706",
language = "English (US)",
volume = "16",
pages = "1596--1604",
journal = "Genome Research",
issn = "1088-9051",
publisher = "Cold Spring Harbor Laboratory Press",
number = "12",

}

ESPERR : Learning strong and weak signals in genomic sequence alignments to identify functional elements. / Taylor, James; Tyekucheva, Svitlana; King, David C.; Hardison, Ross C.; Miller, Webb; Chiaromonte, Francesca.

In: Genome research, Vol. 16, No. 12, 01.12.2006, p. 1596-1604.

Research output: Contribution to journalArticle

TY - JOUR

T1 - ESPERR

T2 - Learning strong and weak signals in genomic sequence alignments to identify functional elements

AU - Taylor, James

AU - Tyekucheva, Svitlana

AU - King, David C.

AU - Hardison, Ross C.

AU - Miller, Webb

AU - Chiaromonte, Francesca

PY - 2006/12/1

Y1 - 2006/12/1

N2 - Genomic sequence signals - such as base composition, presence of particular motifs, or evolutionary constraint - have been used effectively to identify functional elements. However, approaches based only on specific signals known to correlate with function can be quite limiting. When training data are available, application of computational learning algorithms to multispecies alignments has the potential to capture broader and more informative sequence and evolutionary patterns that better characterize a class of elements. However, effective exploitation of patterns in multispecies alignments is impeded by the vast number of possible alignment columns and by a limited understanding of which particular strings of columns may characterize a given class. We have developed a computational method, called ESPERR (evolutionary and sequence pattern extraction through reduced representations), which uses training examples to learn encodings of multispecies alignments into reduced forms tailored for the prediction of chosen classes of functional elements. ESPERR produces a greatly improved Regulatory Potential score, which can discriminate regulatory regions from neutral sites with excellent accuracy (∼94%). This score captures strong signals (GC content and conservation), as well as subtler signals (with small contributions from many different alignment patterns) that characterize the regulatory elements in our training set. ESPERR is also effective for predicting other classes of functional elements, as we show for DNaseI hypersensitive sites and highly conserved regions with developmental enhancer activity. Our software, training data, and genome-wide predictions are available from our Web site (http://www.bx.psu.edu/projects/esperr).

AB - Genomic sequence signals - such as base composition, presence of particular motifs, or evolutionary constraint - have been used effectively to identify functional elements. However, approaches based only on specific signals known to correlate with function can be quite limiting. When training data are available, application of computational learning algorithms to multispecies alignments has the potential to capture broader and more informative sequence and evolutionary patterns that better characterize a class of elements. However, effective exploitation of patterns in multispecies alignments is impeded by the vast number of possible alignment columns and by a limited understanding of which particular strings of columns may characterize a given class. We have developed a computational method, called ESPERR (evolutionary and sequence pattern extraction through reduced representations), which uses training examples to learn encodings of multispecies alignments into reduced forms tailored for the prediction of chosen classes of functional elements. ESPERR produces a greatly improved Regulatory Potential score, which can discriminate regulatory regions from neutral sites with excellent accuracy (∼94%). This score captures strong signals (GC content and conservation), as well as subtler signals (with small contributions from many different alignment patterns) that characterize the regulatory elements in our training set. ESPERR is also effective for predicting other classes of functional elements, as we show for DNaseI hypersensitive sites and highly conserved regions with developmental enhancer activity. Our software, training data, and genome-wide predictions are available from our Web site (http://www.bx.psu.edu/projects/esperr).

UR - http://www.scopus.com/inward/record.url?scp=33845303175&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33845303175&partnerID=8YFLogxK

U2 - 10.1101/gr.4537706

DO - 10.1101/gr.4537706

M3 - Article

C2 - 17053093

AN - SCOPUS:33845303175

VL - 16

SP - 1596

EP - 1604

JO - Genome Research

JF - Genome Research

SN - 1088-9051

IS - 12

ER -