Imputation and quality control steps for combining multiple genome-wide datasets

Shefali S. Verma, Mariza de Andrade, Gerard Tromp, Helena Kuivaniemi, Elizabeth Pugh, Bahram Namjou-Khales, Shubhabrata Mukherjee, Gail P. Jarvik, Leah C. Kottyan, Amber Burt, Yuki Bradford, Gretta D. Armstrong, Kimberly Derr, Dana C. Crawford, Jonathan L. Haines, Rongling Li, David Crosslin, Marylyn Deriggi Ritchie

Research output: Contribution to journalArticle

47 Citations (Scopus)

Abstract

The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R2 (estimated correlation between the imputed and true genotypes), and the relationship between allelic R2 and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.

Original languageEnglish (US)
Article number370
JournalFrontiers in Genetics
Volume5
Issue numberDEC
DOIs
StatePublished - Jan 1 2014

Fingerprint

Electronic Health Records
Quality Control
Genomics
Genome
Single Nucleotide Polymorphism
Software
DNA
Gene Frequency
Sample Size
Genotype
Datasets
Population

All Science Journal Classification (ASJC) codes

  • Molecular Medicine
  • Genetics
  • Genetics(clinical)

Cite this

Verma, S. S., de Andrade, M., Tromp, G., Kuivaniemi, H., Pugh, E., Namjou-Khales, B., ... Ritchie, M. D. (2014). Imputation and quality control steps for combining multiple genome-wide datasets. Frontiers in Genetics, 5(DEC), [370]. https://doi.org/10.3389/fgene.2014.00370
Verma, Shefali S. ; de Andrade, Mariza ; Tromp, Gerard ; Kuivaniemi, Helena ; Pugh, Elizabeth ; Namjou-Khales, Bahram ; Mukherjee, Shubhabrata ; Jarvik, Gail P. ; Kottyan, Leah C. ; Burt, Amber ; Bradford, Yuki ; Armstrong, Gretta D. ; Derr, Kimberly ; Crawford, Dana C. ; Haines, Jonathan L. ; Li, Rongling ; Crosslin, David ; Ritchie, Marylyn Deriggi. / Imputation and quality control steps for combining multiple genome-wide datasets. In: Frontiers in Genetics. 2014 ; Vol. 5, No. DEC.
@article{2dbab588b38b4a2d996967eed5398378,
title = "Imputation and quality control steps for combining multiple genome-wide datasets",
abstract = "The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R2 (estimated correlation between the imputed and true genotypes), and the relationship between allelic R2 and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.",
author = "Verma, {Shefali S.} and {de Andrade}, Mariza and Gerard Tromp and Helena Kuivaniemi and Elizabeth Pugh and Bahram Namjou-Khales and Shubhabrata Mukherjee and Jarvik, {Gail P.} and Kottyan, {Leah C.} and Amber Burt and Yuki Bradford and Armstrong, {Gretta D.} and Kimberly Derr and Crawford, {Dana C.} and Haines, {Jonathan L.} and Rongling Li and David Crosslin and Ritchie, {Marylyn Deriggi}",
year = "2014",
month = "1",
day = "1",
doi = "10.3389/fgene.2014.00370",
language = "English (US)",
volume = "5",
journal = "Frontiers in Genetics",
issn = "1664-8021",
publisher = "Frontiers Media S. A.",
number = "DEC",

}

Verma, SS, de Andrade, M, Tromp, G, Kuivaniemi, H, Pugh, E, Namjou-Khales, B, Mukherjee, S, Jarvik, GP, Kottyan, LC, Burt, A, Bradford, Y, Armstrong, GD, Derr, K, Crawford, DC, Haines, JL, Li, R, Crosslin, D & Ritchie, MD 2014, 'Imputation and quality control steps for combining multiple genome-wide datasets', Frontiers in Genetics, vol. 5, no. DEC, 370. https://doi.org/10.3389/fgene.2014.00370

Imputation and quality control steps for combining multiple genome-wide datasets. / Verma, Shefali S.; de Andrade, Mariza; Tromp, Gerard; Kuivaniemi, Helena; Pugh, Elizabeth; Namjou-Khales, Bahram; Mukherjee, Shubhabrata; Jarvik, Gail P.; Kottyan, Leah C.; Burt, Amber; Bradford, Yuki; Armstrong, Gretta D.; Derr, Kimberly; Crawford, Dana C.; Haines, Jonathan L.; Li, Rongling; Crosslin, David; Ritchie, Marylyn Deriggi.

In: Frontiers in Genetics, Vol. 5, No. DEC, 370, 01.01.2014.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Imputation and quality control steps for combining multiple genome-wide datasets

AU - Verma, Shefali S.

AU - de Andrade, Mariza

AU - Tromp, Gerard

AU - Kuivaniemi, Helena

AU - Pugh, Elizabeth

AU - Namjou-Khales, Bahram

AU - Mukherjee, Shubhabrata

AU - Jarvik, Gail P.

AU - Kottyan, Leah C.

AU - Burt, Amber

AU - Bradford, Yuki

AU - Armstrong, Gretta D.

AU - Derr, Kimberly

AU - Crawford, Dana C.

AU - Haines, Jonathan L.

AU - Li, Rongling

AU - Crosslin, David

AU - Ritchie, Marylyn Deriggi

PY - 2014/1/1

Y1 - 2014/1/1

N2 - The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R2 (estimated correlation between the imputed and true genotypes), and the relationship between allelic R2 and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.

AB - The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R2 (estimated correlation between the imputed and true genotypes), and the relationship between allelic R2 and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.

UR - http://www.scopus.com/inward/record.url?scp=84917732232&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84917732232&partnerID=8YFLogxK

U2 - 10.3389/fgene.2014.00370

DO - 10.3389/fgene.2014.00370

M3 - Article

C2 - 25566314

AN - SCOPUS:84917732232

VL - 5

JO - Frontiers in Genetics

JF - Frontiers in Genetics

SN - 1664-8021

IS - DEC

M1 - 370

ER -

Verma SS, de Andrade M, Tromp G, Kuivaniemi H, Pugh E, Namjou-Khales B et al. Imputation and quality control steps for combining multiple genome-wide datasets. Frontiers in Genetics. 2014 Jan 1;5(DEC). 370. https://doi.org/10.3389/fgene.2014.00370