RecoverY: K-mer-based read classification for Y-chromosome-specific sequencing and assembly

Samarth Rangavittal, Robert S. Harris, Monika Cechova, Marta Tomaszkiewicz, Rayan Chikhi, Kateryna D. Makova, Paul Medvedev

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

Motivation The haploid mammalian Y chromosome is usually under-represented in genome assemblies due to high repeat content and low depth due to its haploid nature. One strategy to ameliorate the low coverage of Y sequences is to experimentally enrich Y-specific material before assembly. As the enrichment process is imperfect, algorithms are needed to identify putative Y-specific reads prior to downstream assembly. A strategy that uses k-mer abundances to identify such reads was used to assemble the gorilla Y. However, the strategy required the manual setting of key parameters, a time-consuming process leading to sub-optimal assemblies. Results We develop a method, RecoverY, that selects Y-specific reads by automatically choosing the abundance level at which a k-mer is deemed to originate from the Y. This algorithm uses prior knowledge about the Y chromosome of a related species or known Y transcript sequences. We evaluate RecoverY on both simulated and real data, for human and gorilla, and investigate its robustness to important parameters. We show that RecoverY leads to a vastly superior assembly compared to alternate strategies of filtering the reads or contigs. Compared to the preliminary strategy used by Tomaszkiewicz et al., we achieve a 33% improvement in assembly size and a 20% improvement in the NG50, demonstrating the power of automatic parameter selection. Availability and implementation Our tool RecoverY is freely available at https://github.com/makovalab-psu/RecoverY. Contact kmakova@bx.psu.edu or pashadag@cse.psu.edu Supplementary informationSupplementary data are available at Bioinformatics online.

Original languageEnglish (US)
Pages (from-to)1125-1131
Number of pages7
JournalBioinformatics
Volume34
Issue number7
DOIs
StatePublished - Apr 1 2018

Fingerprint

Gorilla gorilla
Y Chromosome
Haploidy
Chromosomes
Sequencing
Chromosome
Recovery
Mammalian Chromosomes
Computational Biology
Genome
Parameter Selection
Bioinformatics
Prior Knowledge
Imperfect
Alternate
Coverage
Availability
Filtering
Genes
Strategy

All Science Journal Classification (ASJC) codes

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Cite this

Rangavittal, Samarth ; Harris, Robert S. ; Cechova, Monika ; Tomaszkiewicz, Marta ; Chikhi, Rayan ; Makova, Kateryna D. ; Medvedev, Paul. / RecoverY : K-mer-based read classification for Y-chromosome-specific sequencing and assembly. In: Bioinformatics. 2018 ; Vol. 34, No. 7. pp. 1125-1131.
@article{a5b9c22f02c344438dbd901814a38c15,
title = "RecoverY: K-mer-based read classification for Y-chromosome-specific sequencing and assembly",
abstract = "Motivation The haploid mammalian Y chromosome is usually under-represented in genome assemblies due to high repeat content and low depth due to its haploid nature. One strategy to ameliorate the low coverage of Y sequences is to experimentally enrich Y-specific material before assembly. As the enrichment process is imperfect, algorithms are needed to identify putative Y-specific reads prior to downstream assembly. A strategy that uses k-mer abundances to identify such reads was used to assemble the gorilla Y. However, the strategy required the manual setting of key parameters, a time-consuming process leading to sub-optimal assemblies. Results We develop a method, RecoverY, that selects Y-specific reads by automatically choosing the abundance level at which a k-mer is deemed to originate from the Y. This algorithm uses prior knowledge about the Y chromosome of a related species or known Y transcript sequences. We evaluate RecoverY on both simulated and real data, for human and gorilla, and investigate its robustness to important parameters. We show that RecoverY leads to a vastly superior assembly compared to alternate strategies of filtering the reads or contigs. Compared to the preliminary strategy used by Tomaszkiewicz et al., we achieve a 33{\%} improvement in assembly size and a 20{\%} improvement in the NG50, demonstrating the power of automatic parameter selection. Availability and implementation Our tool RecoverY is freely available at https://github.com/makovalab-psu/RecoverY. Contact kmakova@bx.psu.edu or pashadag@cse.psu.edu Supplementary informationSupplementary data are available at Bioinformatics online.",
author = "Samarth Rangavittal and Harris, {Robert S.} and Monika Cechova and Marta Tomaszkiewicz and Rayan Chikhi and Makova, {Kateryna D.} and Paul Medvedev",
year = "2018",
month = "4",
day = "1",
doi = "10.1093/bioinformatics/btx771",
language = "English (US)",
volume = "34",
pages = "1125--1131",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "7",

}

RecoverY : K-mer-based read classification for Y-chromosome-specific sequencing and assembly. / Rangavittal, Samarth; Harris, Robert S.; Cechova, Monika; Tomaszkiewicz, Marta; Chikhi, Rayan; Makova, Kateryna D.; Medvedev, Paul.

In: Bioinformatics, Vol. 34, No. 7, 01.04.2018, p. 1125-1131.

Research output: Contribution to journalArticle

TY - JOUR

T1 - RecoverY

T2 - K-mer-based read classification for Y-chromosome-specific sequencing and assembly

AU - Rangavittal, Samarth

AU - Harris, Robert S.

AU - Cechova, Monika

AU - Tomaszkiewicz, Marta

AU - Chikhi, Rayan

AU - Makova, Kateryna D.

AU - Medvedev, Paul

PY - 2018/4/1

Y1 - 2018/4/1

N2 - Motivation The haploid mammalian Y chromosome is usually under-represented in genome assemblies due to high repeat content and low depth due to its haploid nature. One strategy to ameliorate the low coverage of Y sequences is to experimentally enrich Y-specific material before assembly. As the enrichment process is imperfect, algorithms are needed to identify putative Y-specific reads prior to downstream assembly. A strategy that uses k-mer abundances to identify such reads was used to assemble the gorilla Y. However, the strategy required the manual setting of key parameters, a time-consuming process leading to sub-optimal assemblies. Results We develop a method, RecoverY, that selects Y-specific reads by automatically choosing the abundance level at which a k-mer is deemed to originate from the Y. This algorithm uses prior knowledge about the Y chromosome of a related species or known Y transcript sequences. We evaluate RecoverY on both simulated and real data, for human and gorilla, and investigate its robustness to important parameters. We show that RecoverY leads to a vastly superior assembly compared to alternate strategies of filtering the reads or contigs. Compared to the preliminary strategy used by Tomaszkiewicz et al., we achieve a 33% improvement in assembly size and a 20% improvement in the NG50, demonstrating the power of automatic parameter selection. Availability and implementation Our tool RecoverY is freely available at https://github.com/makovalab-psu/RecoverY. Contact kmakova@bx.psu.edu or pashadag@cse.psu.edu Supplementary informationSupplementary data are available at Bioinformatics online.

AB - Motivation The haploid mammalian Y chromosome is usually under-represented in genome assemblies due to high repeat content and low depth due to its haploid nature. One strategy to ameliorate the low coverage of Y sequences is to experimentally enrich Y-specific material before assembly. As the enrichment process is imperfect, algorithms are needed to identify putative Y-specific reads prior to downstream assembly. A strategy that uses k-mer abundances to identify such reads was used to assemble the gorilla Y. However, the strategy required the manual setting of key parameters, a time-consuming process leading to sub-optimal assemblies. Results We develop a method, RecoverY, that selects Y-specific reads by automatically choosing the abundance level at which a k-mer is deemed to originate from the Y. This algorithm uses prior knowledge about the Y chromosome of a related species or known Y transcript sequences. We evaluate RecoverY on both simulated and real data, for human and gorilla, and investigate its robustness to important parameters. We show that RecoverY leads to a vastly superior assembly compared to alternate strategies of filtering the reads or contigs. Compared to the preliminary strategy used by Tomaszkiewicz et al., we achieve a 33% improvement in assembly size and a 20% improvement in the NG50, demonstrating the power of automatic parameter selection. Availability and implementation Our tool RecoverY is freely available at https://github.com/makovalab-psu/RecoverY. Contact kmakova@bx.psu.edu or pashadag@cse.psu.edu Supplementary informationSupplementary data are available at Bioinformatics online.

UR - http://www.scopus.com/inward/record.url?scp=85045830143&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85045830143&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btx771

DO - 10.1093/bioinformatics/btx771

M3 - Article

C2 - 29194476

AN - SCOPUS:85045830143

VL - 34

SP - 1125

EP - 1131

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 7

ER -