Beyond the E-Value: Stratified Statistics for Protein Domain Prediction

Alejandro Ochoa, John D. Storey, Manuel Llinás, Mona Singh

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for “stratified” multiple hypothesis testing problems—that is, those in which statistical tests can be partitioned naturally—controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which weaknesses in random sequence models yield notably inaccurate statistical significance measures. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, and motif scanning.

Original languageEnglish (US)
Article numbere1004509
JournalPLoS computational biology
Volume11
Issue number11
DOIs
StatePublished - Nov 2015

Fingerprint

statistics
Statistics
Proteins
Protein
protein
prediction
Sequence Alignment
Prediction
Genome-Wide Association Study
Protein Sequence Analysis
Computational Biology
Noise
proteins
Multiple Hypothesis Testing
Statistical tests
hypothesis testing
Random Sequence
Statistical test
statistical analysis
bioinformatics

All Science Journal Classification (ASJC) codes

  • Ecology, Evolution, Behavior and Systematics
  • Modeling and Simulation
  • Ecology
  • Molecular Biology
  • Genetics
  • Cellular and Molecular Neuroscience
  • Computational Theory and Mathematics

Cite this

Ochoa, Alejandro ; Storey, John D. ; Llinás, Manuel ; Singh, Mona. / Beyond the E-Value : Stratified Statistics for Protein Domain Prediction. In: PLoS computational biology. 2015 ; Vol. 11, No. 11.
@article{778d6a885bed46988a0eb11f4eeed03c,
title = "Beyond the E-Value: Stratified Statistics for Protein Domain Prediction",
abstract = "E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for “stratified” multiple hypothesis testing problems—that is, those in which statistical tests can be partitioned naturally—controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which weaknesses in random sequence models yield notably inaccurate statistical significance measures. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, and motif scanning.",
author = "Alejandro Ochoa and Storey, {John D.} and Manuel Llin{\'a}s and Mona Singh",
year = "2015",
month = "11",
doi = "10.1371/journal.pcbi.1004509",
language = "English (US)",
volume = "11",
journal = "PLoS Computational Biology",
issn = "1553-734X",
publisher = "Public Library of Science",
number = "11",

}

Beyond the E-Value : Stratified Statistics for Protein Domain Prediction. / Ochoa, Alejandro; Storey, John D.; Llinás, Manuel; Singh, Mona.

In: PLoS computational biology, Vol. 11, No. 11, e1004509, 11.2015.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Beyond the E-Value

T2 - Stratified Statistics for Protein Domain Prediction

AU - Ochoa, Alejandro

AU - Storey, John D.

AU - Llinás, Manuel

AU - Singh, Mona

PY - 2015/11

Y1 - 2015/11

N2 - E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for “stratified” multiple hypothesis testing problems—that is, those in which statistical tests can be partitioned naturally—controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which weaknesses in random sequence models yield notably inaccurate statistical significance measures. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, and motif scanning.

AB - E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for “stratified” multiple hypothesis testing problems—that is, those in which statistical tests can be partitioned naturally—controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which weaknesses in random sequence models yield notably inaccurate statistical significance measures. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, and motif scanning.

UR - http://www.scopus.com/inward/record.url?scp=84949239520&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84949239520&partnerID=8YFLogxK

U2 - 10.1371/journal.pcbi.1004509

DO - 10.1371/journal.pcbi.1004509

M3 - Article

C2 - 26575353

AN - SCOPUS:84949239520

VL - 11

JO - PLoS Computational Biology

JF - PLoS Computational Biology

SN - 1553-734X

IS - 11

M1 - e1004509

ER -