TY - JOUR
T1 - Bayesian large-scale multiple regression with summary statistics from genome-wide association studies
AU - Zhu, Xiang
AU - Stephens, Matthew
N1 - Funding Information:
Received March 2016; revised April 2017. 1Supported by the Grant GBMF #4559 from the Gordon and Betty Moore Foundation and the NIH Grant HG02585. Key words and phrases. Summary statistics, Bayesian regression, genome wide, association study, multiple-SNP analysis, variable selection, heritability, explained variation, Markov chain Monte Carlo.
Funding Information:
Acknowledgments. We thank the Editor, Associate Editor and two anonymous referees for their constructive comments. We thank Xin He, Rina Foygel Barber, Peter Carbonetto, Yongtao Guan, Xiaoquan Wen and Xiang Zhou for helpful discussions. We thank Raman Shah and John Zekos for technical support. This study makes use of data generated by the Wellcome Trust Case Control Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. Data on adult human height have been contributed by investigators of the Genetic Investigation of Anthropometric Traits (GIANT) consortium. This work was completed in part with resources provided by the University of Chicago Research Computing Center.
Publisher Copyright:
© Institute of Mathematical Statistics, 2017.
PY - 2017/9
Y1 - 2017/9
N2 - Bayesian methods for large-scale multiple regression provide attractive approaches to the analysis of genome-wide association studies (GWAS). For example, they can estimate heritability of complex traits, allowing for both polygenic and sparse models; and by incorporating external genomic data into the priors, they can increase power and yield new biological insights. However, these methods require access to individual genotypes and phenotypes, which are often not easily available. Here we provide a framework for performing these analyses without individual-level data. Specifically, we introduce a “Regression with Summary Statistics” (RSS) likelihood, which relates the multiple regression coefficients to univariate regression results that are often easily available. The RSS likelihood requires estimates of correlations among covariates (SNPs), which also can be obtained from public databases. We perform Bayesian multiple regression analysis by combining the RSS likelihood with previously proposed prior distributions, sampling posteriors by Markov chain Monte Carlo. In a wide range of simulations RSS performs similarly to analyses using the individual data, both for estimating heritability and detecting associations. We apply RSS to a GWAS of human height that contains 253,288 individuals typed at 1.06 million SNPs, for which analyses of individual-level data are practically impossible. Estimates of heritability (52%) are consistent with, but more precise, than previous results using subsets of these data. We also identify many previously unreported loci that show evidence for association with height in our analyses. Software is available at https://github.com/stephenslab/rss.
AB - Bayesian methods for large-scale multiple regression provide attractive approaches to the analysis of genome-wide association studies (GWAS). For example, they can estimate heritability of complex traits, allowing for both polygenic and sparse models; and by incorporating external genomic data into the priors, they can increase power and yield new biological insights. However, these methods require access to individual genotypes and phenotypes, which are often not easily available. Here we provide a framework for performing these analyses without individual-level data. Specifically, we introduce a “Regression with Summary Statistics” (RSS) likelihood, which relates the multiple regression coefficients to univariate regression results that are often easily available. The RSS likelihood requires estimates of correlations among covariates (SNPs), which also can be obtained from public databases. We perform Bayesian multiple regression analysis by combining the RSS likelihood with previously proposed prior distributions, sampling posteriors by Markov chain Monte Carlo. In a wide range of simulations RSS performs similarly to analyses using the individual data, both for estimating heritability and detecting associations. We apply RSS to a GWAS of human height that contains 253,288 individuals typed at 1.06 million SNPs, for which analyses of individual-level data are practically impossible. Estimates of heritability (52%) are consistent with, but more precise, than previous results using subsets of these data. We also identify many previously unreported loci that show evidence for association with height in our analyses. Software is available at https://github.com/stephenslab/rss.
UR - http://www.scopus.com/inward/record.url?scp=85031498677&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85031498677&partnerID=8YFLogxK
U2 - 10.1214/17-AOAS1046
DO - 10.1214/17-AOAS1046
M3 - Article
AN - SCOPUS:85031498677
VL - 11
SP - 1561
EP - 1592
JO - Annals of Applied Statistics
JF - Annals of Applied Statistics
SN - 1932-6157
IS - 3
ER -