The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes

Stephen B. Montgomery, David L. Goode, Erika Kvikstad, Cornelis A. Albers, Zhengdong D. Zhang, Xinmeng Jasmine Mu, Guruprasad Ananda, Bryan Howie, Konrad J. Karczewski, Kevin S. Smith, Vanessa Anaya, Rhea Richardson, Joe Davis, Daniel G. MacArthur, Arend Sidow, Laurent Duret, Mark Gerstein, Kateryna Dmytrivna Makova, Jonathan Marchini, Gil McVeanGerton Lunter

Research output: Contribution to journalArticle

102 Citations (Scopus)

Abstract

Short insertions and deletions (indels) are the second most abundant form of human genetic variation, but our understanding of their origins and functional effects lags behind that of other types of variants. Using population-scale sequencing, we have identified a high-quality set of 1.6 million indels from 179 individuals representing three diverse human populations. We show that rates of indel mutagenesis are highly heterogeneous, with 43%-48% of indels occurring in 4.03% of the genome, whereas in the remaining 96% their prevalence is 16 times lower than SNPs. Polymerase slippage can explain upwards of three-fourths of all indels, with the remainder being mostly simple deletions in complex sequence. However, insertions do occur and are significantly associated with pseudo-palindromic sequence features compatible with the fork stalling and template switching (FoSTeS) mechanism more commonly associated with large structural variations. We introduce a quantitative model of polymerase slippage, which enables us to identify indel-hypermutagenic protein-coding genes, some of which are associated with recurrent mutations leading to disease. Accounting for mutational rate heterogeneity due to sequence context, we find that indels across functional sequence are generally subject to stronger purifying selection than SNPs. We find that indel length modulates selection strength, and that indels affecting multiple functionally constrained nucleotides undergo stronger purifying selection. We further find that indels are enriched in associations with gene expression and find evidence for a contribution of nonsense-mediated decay. Finally, we show that indels can be integrated in existing genome-wide association studies (GWAS); although we do not find direct evidence that potentially causal protein-coding indels are enriched with associations to known disease-associated SNPs, our findings suggest that the causal variant underlying some of these associations may be indels.

Original languageEnglish (US)
Pages (from-to)749-761
Number of pages13
JournalGenome research
Volume23
Issue number5
DOIs
StatePublished - May 1 2013

Fingerprint

Human Genome
Single Nucleotide Polymorphism
Genome-Wide Association Study
Medical Genetics
Mutagenesis
Population
Proteins
Nucleotides
Genome
Gene Expression
Mutation

All Science Journal Classification (ASJC) codes

  • Genetics
  • Genetics(clinical)

Cite this

Montgomery, S. B., Goode, D. L., Kvikstad, E., Albers, C. A., Zhang, Z. D., Mu, X. J., ... Lunter, G. (2013). The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. Genome research, 23(5), 749-761. https://doi.org/10.1101/gr.148718.112
Montgomery, Stephen B. ; Goode, David L. ; Kvikstad, Erika ; Albers, Cornelis A. ; Zhang, Zhengdong D. ; Mu, Xinmeng Jasmine ; Ananda, Guruprasad ; Howie, Bryan ; Karczewski, Konrad J. ; Smith, Kevin S. ; Anaya, Vanessa ; Richardson, Rhea ; Davis, Joe ; MacArthur, Daniel G. ; Sidow, Arend ; Duret, Laurent ; Gerstein, Mark ; Makova, Kateryna Dmytrivna ; Marchini, Jonathan ; McVean, Gil ; Lunter, Gerton. / The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. In: Genome research. 2013 ; Vol. 23, No. 5. pp. 749-761.
@article{1df994c3ff494c459cb6193c76f45950,
title = "The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes",
abstract = "Short insertions and deletions (indels) are the second most abundant form of human genetic variation, but our understanding of their origins and functional effects lags behind that of other types of variants. Using population-scale sequencing, we have identified a high-quality set of 1.6 million indels from 179 individuals representing three diverse human populations. We show that rates of indel mutagenesis are highly heterogeneous, with 43{\%}-48{\%} of indels occurring in 4.03{\%} of the genome, whereas in the remaining 96{\%} their prevalence is 16 times lower than SNPs. Polymerase slippage can explain upwards of three-fourths of all indels, with the remainder being mostly simple deletions in complex sequence. However, insertions do occur and are significantly associated with pseudo-palindromic sequence features compatible with the fork stalling and template switching (FoSTeS) mechanism more commonly associated with large structural variations. We introduce a quantitative model of polymerase slippage, which enables us to identify indel-hypermutagenic protein-coding genes, some of which are associated with recurrent mutations leading to disease. Accounting for mutational rate heterogeneity due to sequence context, we find that indels across functional sequence are generally subject to stronger purifying selection than SNPs. We find that indel length modulates selection strength, and that indels affecting multiple functionally constrained nucleotides undergo stronger purifying selection. We further find that indels are enriched in associations with gene expression and find evidence for a contribution of nonsense-mediated decay. Finally, we show that indels can be integrated in existing genome-wide association studies (GWAS); although we do not find direct evidence that potentially causal protein-coding indels are enriched with associations to known disease-associated SNPs, our findings suggest that the causal variant underlying some of these associations may be indels.",
author = "Montgomery, {Stephen B.} and Goode, {David L.} and Erika Kvikstad and Albers, {Cornelis A.} and Zhang, {Zhengdong D.} and Mu, {Xinmeng Jasmine} and Guruprasad Ananda and Bryan Howie and Karczewski, {Konrad J.} and Smith, {Kevin S.} and Vanessa Anaya and Rhea Richardson and Joe Davis and MacArthur, {Daniel G.} and Arend Sidow and Laurent Duret and Mark Gerstein and Makova, {Kateryna Dmytrivna} and Jonathan Marchini and Gil McVean and Gerton Lunter",
year = "2013",
month = "5",
day = "1",
doi = "10.1101/gr.148718.112",
language = "English (US)",
volume = "23",
pages = "749--761",
journal = "Genome Research",
issn = "1088-9051",
publisher = "Cold Spring Harbor Laboratory Press",
number = "5",

}

Montgomery, SB, Goode, DL, Kvikstad, E, Albers, CA, Zhang, ZD, Mu, XJ, Ananda, G, Howie, B, Karczewski, KJ, Smith, KS, Anaya, V, Richardson, R, Davis, J, MacArthur, DG, Sidow, A, Duret, L, Gerstein, M, Makova, KD, Marchini, J, McVean, G & Lunter, G 2013, 'The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes', Genome research, vol. 23, no. 5, pp. 749-761. https://doi.org/10.1101/gr.148718.112

The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. / Montgomery, Stephen B.; Goode, David L.; Kvikstad, Erika; Albers, Cornelis A.; Zhang, Zhengdong D.; Mu, Xinmeng Jasmine; Ananda, Guruprasad; Howie, Bryan; Karczewski, Konrad J.; Smith, Kevin S.; Anaya, Vanessa; Richardson, Rhea; Davis, Joe; MacArthur, Daniel G.; Sidow, Arend; Duret, Laurent; Gerstein, Mark; Makova, Kateryna Dmytrivna; Marchini, Jonathan; McVean, Gil; Lunter, Gerton.

In: Genome research, Vol. 23, No. 5, 01.05.2013, p. 749-761.

Research output: Contribution to journalArticle

TY - JOUR

T1 - The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes

AU - Montgomery, Stephen B.

AU - Goode, David L.

AU - Kvikstad, Erika

AU - Albers, Cornelis A.

AU - Zhang, Zhengdong D.

AU - Mu, Xinmeng Jasmine

AU - Ananda, Guruprasad

AU - Howie, Bryan

AU - Karczewski, Konrad J.

AU - Smith, Kevin S.

AU - Anaya, Vanessa

AU - Richardson, Rhea

AU - Davis, Joe

AU - MacArthur, Daniel G.

AU - Sidow, Arend

AU - Duret, Laurent

AU - Gerstein, Mark

AU - Makova, Kateryna Dmytrivna

AU - Marchini, Jonathan

AU - McVean, Gil

AU - Lunter, Gerton

PY - 2013/5/1

Y1 - 2013/5/1

N2 - Short insertions and deletions (indels) are the second most abundant form of human genetic variation, but our understanding of their origins and functional effects lags behind that of other types of variants. Using population-scale sequencing, we have identified a high-quality set of 1.6 million indels from 179 individuals representing three diverse human populations. We show that rates of indel mutagenesis are highly heterogeneous, with 43%-48% of indels occurring in 4.03% of the genome, whereas in the remaining 96% their prevalence is 16 times lower than SNPs. Polymerase slippage can explain upwards of three-fourths of all indels, with the remainder being mostly simple deletions in complex sequence. However, insertions do occur and are significantly associated with pseudo-palindromic sequence features compatible with the fork stalling and template switching (FoSTeS) mechanism more commonly associated with large structural variations. We introduce a quantitative model of polymerase slippage, which enables us to identify indel-hypermutagenic protein-coding genes, some of which are associated with recurrent mutations leading to disease. Accounting for mutational rate heterogeneity due to sequence context, we find that indels across functional sequence are generally subject to stronger purifying selection than SNPs. We find that indel length modulates selection strength, and that indels affecting multiple functionally constrained nucleotides undergo stronger purifying selection. We further find that indels are enriched in associations with gene expression and find evidence for a contribution of nonsense-mediated decay. Finally, we show that indels can be integrated in existing genome-wide association studies (GWAS); although we do not find direct evidence that potentially causal protein-coding indels are enriched with associations to known disease-associated SNPs, our findings suggest that the causal variant underlying some of these associations may be indels.

AB - Short insertions and deletions (indels) are the second most abundant form of human genetic variation, but our understanding of their origins and functional effects lags behind that of other types of variants. Using population-scale sequencing, we have identified a high-quality set of 1.6 million indels from 179 individuals representing three diverse human populations. We show that rates of indel mutagenesis are highly heterogeneous, with 43%-48% of indels occurring in 4.03% of the genome, whereas in the remaining 96% their prevalence is 16 times lower than SNPs. Polymerase slippage can explain upwards of three-fourths of all indels, with the remainder being mostly simple deletions in complex sequence. However, insertions do occur and are significantly associated with pseudo-palindromic sequence features compatible with the fork stalling and template switching (FoSTeS) mechanism more commonly associated with large structural variations. We introduce a quantitative model of polymerase slippage, which enables us to identify indel-hypermutagenic protein-coding genes, some of which are associated with recurrent mutations leading to disease. Accounting for mutational rate heterogeneity due to sequence context, we find that indels across functional sequence are generally subject to stronger purifying selection than SNPs. We find that indel length modulates selection strength, and that indels affecting multiple functionally constrained nucleotides undergo stronger purifying selection. We further find that indels are enriched in associations with gene expression and find evidence for a contribution of nonsense-mediated decay. Finally, we show that indels can be integrated in existing genome-wide association studies (GWAS); although we do not find direct evidence that potentially causal protein-coding indels are enriched with associations to known disease-associated SNPs, our findings suggest that the causal variant underlying some of these associations may be indels.

UR - http://www.scopus.com/inward/record.url?scp=84876523427&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84876523427&partnerID=8YFLogxK

U2 - 10.1101/gr.148718.112

DO - 10.1101/gr.148718.112

M3 - Article

C2 - 23478400

AN - SCOPUS:84876523427

VL - 23

SP - 749

EP - 761

JO - Genome Research

JF - Genome Research

SN - 1088-9051

IS - 5

ER -