SEQSpark

A Complete Analysis Tool for Large-Scale Rare Variant Association Studies Using Whole-Genome and Exome Sequence Data

Di Zhang, Linhai Zhao, Biao Li, Zongxiao He, Gao T. Wang, Dajiang Liu, Suzanne M. Leal

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Massively parallel sequencing technologies provide great opportunities for discovering rare susceptibility variants involved in complex disease etiology via large-scale imputation and exome and whole-genome sequence-based association studies. Due to modest effect sizes, large sample sizes of tens to hundreds of thousands of individuals are required for adequately powered studies. Current analytical tools are obsolete when it comes to handling these large datasets. To facilitate the analysis of large-scale sequence-based studies, we developed SEQSpark which implements parallel processing based on Spark to increase the speed and efficiency of performing data quality control, annotation, and association analysis. To demonstrate the versatility and speed of SEQSpark, we analyzed whole-genome sequence data from the UK10K, testing for associations with waist-to-hip ratios. The analysis, which was completed in 1.5 hr, included loading data, annotation, principal component analysis, and single variant and rare variant aggregate association analysis of >9 million variants. For rare variant aggregate analysis, an exome-wide significant association (p < 2.5 × 10−6) was observed with CCDC62 (SKAT-O [p = 6.89 × 10−7], combined multivariate collapsing [p = 1.48 × 10−6], and burden of rare variants [p = 1.48 × 10−6]). SEQSpark was also used to analyze 50,000 simulated exomes and it required 1.75 hr for the analysis of a quantitative trait using several rare variant aggregate association methods. Additionally, the performance of SEQSpark was compared to Variant Association Tools and PLINK/SEQ. SEQSpark was always faster and in some situations computation was reduced to a hundredth of the time. SEQSpark will empower large sequence-based epidemiological studies to quickly elucidate genetic variation involved in the etiology of complex traits.

Original languageEnglish (US)
Pages (from-to)115-122
Number of pages8
JournalAmerican Journal of Human Genetics
Volume101
Issue number1
DOIs
StatePublished - Jul 6 2017

Fingerprint

Exome
Genome-Wide Association Study
Genome
High-Throughput Nucleotide Sequencing
Waist-Hip Ratio
Principal Component Analysis
Quality Control
Sample Size
Epidemiologic Studies
Technology

All Science Journal Classification (ASJC) codes

  • Genetics
  • Genetics(clinical)

Cite this

Zhang, Di ; Zhao, Linhai ; Li, Biao ; He, Zongxiao ; Wang, Gao T. ; Liu, Dajiang ; Leal, Suzanne M. / SEQSpark : A Complete Analysis Tool for Large-Scale Rare Variant Association Studies Using Whole-Genome and Exome Sequence Data. In: American Journal of Human Genetics. 2017 ; Vol. 101, No. 1. pp. 115-122.
@article{dd3d2f29e5df44b8b95626c8d1b5a686,
title = "SEQSpark: A Complete Analysis Tool for Large-Scale Rare Variant Association Studies Using Whole-Genome and Exome Sequence Data",
abstract = "Massively parallel sequencing technologies provide great opportunities for discovering rare susceptibility variants involved in complex disease etiology via large-scale imputation and exome and whole-genome sequence-based association studies. Due to modest effect sizes, large sample sizes of tens to hundreds of thousands of individuals are required for adequately powered studies. Current analytical tools are obsolete when it comes to handling these large datasets. To facilitate the analysis of large-scale sequence-based studies, we developed SEQSpark which implements parallel processing based on Spark to increase the speed and efficiency of performing data quality control, annotation, and association analysis. To demonstrate the versatility and speed of SEQSpark, we analyzed whole-genome sequence data from the UK10K, testing for associations with waist-to-hip ratios. The analysis, which was completed in 1.5 hr, included loading data, annotation, principal component analysis, and single variant and rare variant aggregate association analysis of >9 million variants. For rare variant aggregate analysis, an exome-wide significant association (p < 2.5 × 10−6) was observed with CCDC62 (SKAT-O [p = 6.89 × 10−7], combined multivariate collapsing [p = 1.48 × 10−6], and burden of rare variants [p = 1.48 × 10−6]). SEQSpark was also used to analyze 50,000 simulated exomes and it required 1.75 hr for the analysis of a quantitative trait using several rare variant aggregate association methods. Additionally, the performance of SEQSpark was compared to Variant Association Tools and PLINK/SEQ. SEQSpark was always faster and in some situations computation was reduced to a hundredth of the time. SEQSpark will empower large sequence-based epidemiological studies to quickly elucidate genetic variation involved in the etiology of complex traits.",
author = "Di Zhang and Linhai Zhao and Biao Li and Zongxiao He and Wang, {Gao T.} and Dajiang Liu and Leal, {Suzanne M.}",
year = "2017",
month = "7",
day = "6",
doi = "10.1016/j.ajhg.2017.05.017",
language = "English (US)",
volume = "101",
pages = "115--122",
journal = "American Journal of Human Genetics",
issn = "0002-9297",
publisher = "Cell Press",
number = "1",

}

SEQSpark : A Complete Analysis Tool for Large-Scale Rare Variant Association Studies Using Whole-Genome and Exome Sequence Data. / Zhang, Di; Zhao, Linhai; Li, Biao; He, Zongxiao; Wang, Gao T.; Liu, Dajiang; Leal, Suzanne M.

In: American Journal of Human Genetics, Vol. 101, No. 1, 06.07.2017, p. 115-122.

Research output: Contribution to journalArticle

TY - JOUR

T1 - SEQSpark

T2 - A Complete Analysis Tool for Large-Scale Rare Variant Association Studies Using Whole-Genome and Exome Sequence Data

AU - Zhang, Di

AU - Zhao, Linhai

AU - Li, Biao

AU - He, Zongxiao

AU - Wang, Gao T.

AU - Liu, Dajiang

AU - Leal, Suzanne M.

PY - 2017/7/6

Y1 - 2017/7/6

N2 - Massively parallel sequencing technologies provide great opportunities for discovering rare susceptibility variants involved in complex disease etiology via large-scale imputation and exome and whole-genome sequence-based association studies. Due to modest effect sizes, large sample sizes of tens to hundreds of thousands of individuals are required for adequately powered studies. Current analytical tools are obsolete when it comes to handling these large datasets. To facilitate the analysis of large-scale sequence-based studies, we developed SEQSpark which implements parallel processing based on Spark to increase the speed and efficiency of performing data quality control, annotation, and association analysis. To demonstrate the versatility and speed of SEQSpark, we analyzed whole-genome sequence data from the UK10K, testing for associations with waist-to-hip ratios. The analysis, which was completed in 1.5 hr, included loading data, annotation, principal component analysis, and single variant and rare variant aggregate association analysis of >9 million variants. For rare variant aggregate analysis, an exome-wide significant association (p < 2.5 × 10−6) was observed with CCDC62 (SKAT-O [p = 6.89 × 10−7], combined multivariate collapsing [p = 1.48 × 10−6], and burden of rare variants [p = 1.48 × 10−6]). SEQSpark was also used to analyze 50,000 simulated exomes and it required 1.75 hr for the analysis of a quantitative trait using several rare variant aggregate association methods. Additionally, the performance of SEQSpark was compared to Variant Association Tools and PLINK/SEQ. SEQSpark was always faster and in some situations computation was reduced to a hundredth of the time. SEQSpark will empower large sequence-based epidemiological studies to quickly elucidate genetic variation involved in the etiology of complex traits.

AB - Massively parallel sequencing technologies provide great opportunities for discovering rare susceptibility variants involved in complex disease etiology via large-scale imputation and exome and whole-genome sequence-based association studies. Due to modest effect sizes, large sample sizes of tens to hundreds of thousands of individuals are required for adequately powered studies. Current analytical tools are obsolete when it comes to handling these large datasets. To facilitate the analysis of large-scale sequence-based studies, we developed SEQSpark which implements parallel processing based on Spark to increase the speed and efficiency of performing data quality control, annotation, and association analysis. To demonstrate the versatility and speed of SEQSpark, we analyzed whole-genome sequence data from the UK10K, testing for associations with waist-to-hip ratios. The analysis, which was completed in 1.5 hr, included loading data, annotation, principal component analysis, and single variant and rare variant aggregate association analysis of >9 million variants. For rare variant aggregate analysis, an exome-wide significant association (p < 2.5 × 10−6) was observed with CCDC62 (SKAT-O [p = 6.89 × 10−7], combined multivariate collapsing [p = 1.48 × 10−6], and burden of rare variants [p = 1.48 × 10−6]). SEQSpark was also used to analyze 50,000 simulated exomes and it required 1.75 hr for the analysis of a quantitative trait using several rare variant aggregate association methods. Additionally, the performance of SEQSpark was compared to Variant Association Tools and PLINK/SEQ. SEQSpark was always faster and in some situations computation was reduced to a hundredth of the time. SEQSpark will empower large sequence-based epidemiological studies to quickly elucidate genetic variation involved in the etiology of complex traits.

UR - http://www.scopus.com/inward/record.url?scp=85021372906&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85021372906&partnerID=8YFLogxK

U2 - 10.1016/j.ajhg.2017.05.017

DO - 10.1016/j.ajhg.2017.05.017

M3 - Article

VL - 101

SP - 115

EP - 122

JO - American Journal of Human Genetics

JF - American Journal of Human Genetics

SN - 0002-9297

IS - 1

ER -