DiscoverY

A classifier for identifying y chromosome sequences in male assemblies

Samarth Rangavittal, Natasha Stopa, Marta Hoover, Kristoffer Sahlin, Kateryna Dmytrivna Makova, Paul Medvedev

Research output: Contribution to journalArticle

Abstract

Background: Although the Y chromosome plays an important role in male sex determination and fertility, it is currently understudied due to its haploid and repetitive nature. Methods to isolate Y-specific contigs from a whole-genome assembly broadly fall into two categories. The first involves retrieving Y-contigs using proportion sharing with a female, but such a strategy is prone to false positives in the absence of a high-quality, complete female reference. A second strategy uses the ratio of depth of coverage from male and female reads to select Y-contigs, but such a method requires high-depth sequencing of a female and cannot utilize existing female references. Results: We develop a k-mer based method called DiscoverY, which combines proportion sharing with female with depth of coverage from male reads to classify contigs as Y-chromosomal. We evaluate the performance of DiscoverY on human and gorilla genomes, across different sequencing platforms including Illumina, 10X, and PacBio. In the cases where the male and female data are of high quality, DiscoverY has a high precision and recall and outperforms existing methods. For cases when a high quality female reference is not available, we quantify the effect of using draft reference or even just raw sequencing reads from a female. Conclusion: DiscoverY is an effective method to isolate Y-specific contigs from a whole-genome assembly. However, regions homologous to the X chromosome remain difficult to detect.

Original languageEnglish (US)
Article number641
JournalBMC genomics
Volume20
Issue number1
DOIs
StatePublished - Aug 9 2019

Fingerprint

Chromosomes
Genome
Gorilla gorilla
Y Chromosome
Haploidy
X Chromosome
Human Genome
Fertility

All Science Journal Classification (ASJC) codes

  • Biotechnology
  • Genetics

Cite this

Rangavittal, Samarth ; Stopa, Natasha ; Hoover, Marta ; Sahlin, Kristoffer ; Makova, Kateryna Dmytrivna ; Medvedev, Paul. / DiscoverY : A classifier for identifying y chromosome sequences in male assemblies. In: BMC genomics. 2019 ; Vol. 20, No. 1.
@article{fec4bf1c04a44260bf039e5109e4a0b0,
title = "DiscoverY: A classifier for identifying y chromosome sequences in male assemblies",
abstract = "Background: Although the Y chromosome plays an important role in male sex determination and fertility, it is currently understudied due to its haploid and repetitive nature. Methods to isolate Y-specific contigs from a whole-genome assembly broadly fall into two categories. The first involves retrieving Y-contigs using proportion sharing with a female, but such a strategy is prone to false positives in the absence of a high-quality, complete female reference. A second strategy uses the ratio of depth of coverage from male and female reads to select Y-contigs, but such a method requires high-depth sequencing of a female and cannot utilize existing female references. Results: We develop a k-mer based method called DiscoverY, which combines proportion sharing with female with depth of coverage from male reads to classify contigs as Y-chromosomal. We evaluate the performance of DiscoverY on human and gorilla genomes, across different sequencing platforms including Illumina, 10X, and PacBio. In the cases where the male and female data are of high quality, DiscoverY has a high precision and recall and outperforms existing methods. For cases when a high quality female reference is not available, we quantify the effect of using draft reference or even just raw sequencing reads from a female. Conclusion: DiscoverY is an effective method to isolate Y-specific contigs from a whole-genome assembly. However, regions homologous to the X chromosome remain difficult to detect.",
author = "Samarth Rangavittal and Natasha Stopa and Marta Hoover and Kristoffer Sahlin and Makova, {Kateryna Dmytrivna} and Paul Medvedev",
year = "2019",
month = "8",
day = "9",
doi = "10.1186/s12864-019-5996-3",
language = "English (US)",
volume = "20",
journal = "BMC Genomics",
issn = "1471-2164",
publisher = "BioMed Central",
number = "1",

}

DiscoverY : A classifier for identifying y chromosome sequences in male assemblies. / Rangavittal, Samarth; Stopa, Natasha; Hoover, Marta; Sahlin, Kristoffer; Makova, Kateryna Dmytrivna; Medvedev, Paul.

In: BMC genomics, Vol. 20, No. 1, 641, 09.08.2019.

Research output: Contribution to journalArticle

TY - JOUR

T1 - DiscoverY

T2 - A classifier for identifying y chromosome sequences in male assemblies

AU - Rangavittal, Samarth

AU - Stopa, Natasha

AU - Hoover, Marta

AU - Sahlin, Kristoffer

AU - Makova, Kateryna Dmytrivna

AU - Medvedev, Paul

PY - 2019/8/9

Y1 - 2019/8/9

N2 - Background: Although the Y chromosome plays an important role in male sex determination and fertility, it is currently understudied due to its haploid and repetitive nature. Methods to isolate Y-specific contigs from a whole-genome assembly broadly fall into two categories. The first involves retrieving Y-contigs using proportion sharing with a female, but such a strategy is prone to false positives in the absence of a high-quality, complete female reference. A second strategy uses the ratio of depth of coverage from male and female reads to select Y-contigs, but such a method requires high-depth sequencing of a female and cannot utilize existing female references. Results: We develop a k-mer based method called DiscoverY, which combines proportion sharing with female with depth of coverage from male reads to classify contigs as Y-chromosomal. We evaluate the performance of DiscoverY on human and gorilla genomes, across different sequencing platforms including Illumina, 10X, and PacBio. In the cases where the male and female data are of high quality, DiscoverY has a high precision and recall and outperforms existing methods. For cases when a high quality female reference is not available, we quantify the effect of using draft reference or even just raw sequencing reads from a female. Conclusion: DiscoverY is an effective method to isolate Y-specific contigs from a whole-genome assembly. However, regions homologous to the X chromosome remain difficult to detect.

AB - Background: Although the Y chromosome plays an important role in male sex determination and fertility, it is currently understudied due to its haploid and repetitive nature. Methods to isolate Y-specific contigs from a whole-genome assembly broadly fall into two categories. The first involves retrieving Y-contigs using proportion sharing with a female, but such a strategy is prone to false positives in the absence of a high-quality, complete female reference. A second strategy uses the ratio of depth of coverage from male and female reads to select Y-contigs, but such a method requires high-depth sequencing of a female and cannot utilize existing female references. Results: We develop a k-mer based method called DiscoverY, which combines proportion sharing with female with depth of coverage from male reads to classify contigs as Y-chromosomal. We evaluate the performance of DiscoverY on human and gorilla genomes, across different sequencing platforms including Illumina, 10X, and PacBio. In the cases where the male and female data are of high quality, DiscoverY has a high precision and recall and outperforms existing methods. For cases when a high quality female reference is not available, we quantify the effect of using draft reference or even just raw sequencing reads from a female. Conclusion: DiscoverY is an effective method to isolate Y-specific contigs from a whole-genome assembly. However, regions homologous to the X chromosome remain difficult to detect.

UR - http://www.scopus.com/inward/record.url?scp=85070546695&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85070546695&partnerID=8YFLogxK

U2 - 10.1186/s12864-019-5996-3

DO - 10.1186/s12864-019-5996-3

M3 - Article

VL - 20

JO - BMC Genomics

JF - BMC Genomics

SN - 1471-2164

IS - 1

M1 - 641

ER -