A machine-learning approach for accurate detection of copy number variants from exome sequencing

Vijay Kumar Pounraja, Gopal Jayakar, Matthew Jensen, Neil Kelkar, Santhosh Girirajan

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Copy number variants (CNVs) are a major cause of several genetic disorders, making their detection an essential component of genetic analysis pipelines. Current methods for detecting CNVs from exome-sequencing data are limited by high false-positive rates and low concordance because of inherent biases of individual algorithms. To overcome these issues, calls generated by two or more algorithms are often intersected using Venn diagram approaches to identify “high-confidence” CNVs. However, this approach is inadequate, because it misses potentially true calls that do not have consensus from multiple callers. Here, we present CN-Learn, a machine-learning framework that integrates calls from multiple CNV detection algorithms and learns to accurately identify true CNVs using caller-specific and genomic features from a small subset of validated CNVs. Using CNVs predicted by four exome-based CNV callers (CANOES, CODEX, XHMM, and CLAMMS) from 503 samples, we demonstrate that CN-Learn identifies true CNVs at higher precision (∼90%) and recall (∼85%) rates while maintaining robust performance even when trained with minimal data (∼30 samples). CN-Learn recovers twice as many CNVs compared to individual callers or Venn diagram-based approaches, with features such as exome capture probe count, caller concordance, and GC content providing the most discriminatory power. In fact, ∼58% of all true CNVs recovered by CN-Learn were either singletons or calls that lacked support from at least one caller. Our study underscores the limitations of current approaches for CNV identification and provides an effective method that yields high-quality CNVs for application in clinical diagnostics.

Original languageEnglish (US)
Pages (from-to)1134-1143
Number of pages10
JournalGenome research
Volume29
Issue number7
DOIs
StatePublished - Jan 1 2019

Fingerprint

Exome
Inborn Genetic Diseases
Base Composition
Machine Learning

All Science Journal Classification (ASJC) codes

  • Genetics
  • Genetics(clinical)

Cite this

Pounraja, Vijay Kumar ; Jayakar, Gopal ; Jensen, Matthew ; Kelkar, Neil ; Girirajan, Santhosh. / A machine-learning approach for accurate detection of copy number variants from exome sequencing. In: Genome research. 2019 ; Vol. 29, No. 7. pp. 1134-1143.
@article{fdb1657740464ae2880b72cea4d59b08,
title = "A machine-learning approach for accurate detection of copy number variants from exome sequencing",
abstract = "Copy number variants (CNVs) are a major cause of several genetic disorders, making their detection an essential component of genetic analysis pipelines. Current methods for detecting CNVs from exome-sequencing data are limited by high false-positive rates and low concordance because of inherent biases of individual algorithms. To overcome these issues, calls generated by two or more algorithms are often intersected using Venn diagram approaches to identify “high-confidence” CNVs. However, this approach is inadequate, because it misses potentially true calls that do not have consensus from multiple callers. Here, we present CN-Learn, a machine-learning framework that integrates calls from multiple CNV detection algorithms and learns to accurately identify true CNVs using caller-specific and genomic features from a small subset of validated CNVs. Using CNVs predicted by four exome-based CNV callers (CANOES, CODEX, XHMM, and CLAMMS) from 503 samples, we demonstrate that CN-Learn identifies true CNVs at higher precision (∼90{\%}) and recall (∼85{\%}) rates while maintaining robust performance even when trained with minimal data (∼30 samples). CN-Learn recovers twice as many CNVs compared to individual callers or Venn diagram-based approaches, with features such as exome capture probe count, caller concordance, and GC content providing the most discriminatory power. In fact, ∼58{\%} of all true CNVs recovered by CN-Learn were either singletons or calls that lacked support from at least one caller. Our study underscores the limitations of current approaches for CNV identification and provides an effective method that yields high-quality CNVs for application in clinical diagnostics.",
author = "Pounraja, {Vijay Kumar} and Gopal Jayakar and Matthew Jensen and Neil Kelkar and Santhosh Girirajan",
year = "2019",
month = "1",
day = "1",
doi = "10.1101/gr.245928.118",
language = "English (US)",
volume = "29",
pages = "1134--1143",
journal = "Genome Research",
issn = "1088-9051",
publisher = "Cold Spring Harbor Laboratory Press",
number = "7",

}

A machine-learning approach for accurate detection of copy number variants from exome sequencing. / Pounraja, Vijay Kumar; Jayakar, Gopal; Jensen, Matthew; Kelkar, Neil; Girirajan, Santhosh.

In: Genome research, Vol. 29, No. 7, 01.01.2019, p. 1134-1143.

Research output: Contribution to journalArticle

TY - JOUR

T1 - A machine-learning approach for accurate detection of copy number variants from exome sequencing

AU - Pounraja, Vijay Kumar

AU - Jayakar, Gopal

AU - Jensen, Matthew

AU - Kelkar, Neil

AU - Girirajan, Santhosh

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Copy number variants (CNVs) are a major cause of several genetic disorders, making their detection an essential component of genetic analysis pipelines. Current methods for detecting CNVs from exome-sequencing data are limited by high false-positive rates and low concordance because of inherent biases of individual algorithms. To overcome these issues, calls generated by two or more algorithms are often intersected using Venn diagram approaches to identify “high-confidence” CNVs. However, this approach is inadequate, because it misses potentially true calls that do not have consensus from multiple callers. Here, we present CN-Learn, a machine-learning framework that integrates calls from multiple CNV detection algorithms and learns to accurately identify true CNVs using caller-specific and genomic features from a small subset of validated CNVs. Using CNVs predicted by four exome-based CNV callers (CANOES, CODEX, XHMM, and CLAMMS) from 503 samples, we demonstrate that CN-Learn identifies true CNVs at higher precision (∼90%) and recall (∼85%) rates while maintaining robust performance even when trained with minimal data (∼30 samples). CN-Learn recovers twice as many CNVs compared to individual callers or Venn diagram-based approaches, with features such as exome capture probe count, caller concordance, and GC content providing the most discriminatory power. In fact, ∼58% of all true CNVs recovered by CN-Learn were either singletons or calls that lacked support from at least one caller. Our study underscores the limitations of current approaches for CNV identification and provides an effective method that yields high-quality CNVs for application in clinical diagnostics.

AB - Copy number variants (CNVs) are a major cause of several genetic disorders, making their detection an essential component of genetic analysis pipelines. Current methods for detecting CNVs from exome-sequencing data are limited by high false-positive rates and low concordance because of inherent biases of individual algorithms. To overcome these issues, calls generated by two or more algorithms are often intersected using Venn diagram approaches to identify “high-confidence” CNVs. However, this approach is inadequate, because it misses potentially true calls that do not have consensus from multiple callers. Here, we present CN-Learn, a machine-learning framework that integrates calls from multiple CNV detection algorithms and learns to accurately identify true CNVs using caller-specific and genomic features from a small subset of validated CNVs. Using CNVs predicted by four exome-based CNV callers (CANOES, CODEX, XHMM, and CLAMMS) from 503 samples, we demonstrate that CN-Learn identifies true CNVs at higher precision (∼90%) and recall (∼85%) rates while maintaining robust performance even when trained with minimal data (∼30 samples). CN-Learn recovers twice as many CNVs compared to individual callers or Venn diagram-based approaches, with features such as exome capture probe count, caller concordance, and GC content providing the most discriminatory power. In fact, ∼58% of all true CNVs recovered by CN-Learn were either singletons or calls that lacked support from at least one caller. Our study underscores the limitations of current approaches for CNV identification and provides an effective method that yields high-quality CNVs for application in clinical diagnostics.

UR - http://www.scopus.com/inward/record.url?scp=85068723872&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85068723872&partnerID=8YFLogxK

U2 - 10.1101/gr.245928.118

DO - 10.1101/gr.245928.118

M3 - Article

C2 - 31171634

AN - SCOPUS:85068723872

VL - 29

SP - 1134

EP - 1143

JO - Genome Research

JF - Genome Research

SN - 1088-9051

IS - 7

ER -