TY - JOUR
T1 - RecoverY
T2 - K-mer-based read classification for Y-chromosome-specific sequencing and assembly
AU - Rangavittal, Samarth
AU - Harris, Robert S.
AU - Cechova, Monika
AU - Tomaszkiewicz, Marta
AU - Chikhi, Rayan
AU - Makova, Kateryna D.
AU - Medvedev, Paul
N1 - Funding Information:
National Science Foundation (NSF)
Funding Information:
This work was supported by National Science Foundation (NSF) awards DBI-ABI 0965596 (to K.D.M.), DBI-1356529, IIS-1453527, IIS-1421908 and CCF-1439057 (to P.M.). Additionally, this study was supported by the funds made available through the Eberly College of Sciences at Penn State, the Penn State Clinical and Translational Sciences Institute and through the Pennsylvania Department of Health (Tobacco Settlement Funds). The Department specifically disclaims responsibility for any analysis, interpretations or conclusions. M.C. was supported by the National Institutes of Health (NIH)-PSU funded Computation, Bioinformatics and Statistics (CBIOS) Predoctoral Training Program (1T32GM102057-0A1).
Publisher Copyright:
© The Author 2017. Published by Oxford University Press. All rights reserved.
PY - 2018/4/1
Y1 - 2018/4/1
N2 - Motivation The haploid mammalian Y chromosome is usually under-represented in genome assemblies due to high repeat content and low depth due to its haploid nature. One strategy to ameliorate the low coverage of Y sequences is to experimentally enrich Y-specific material before assembly. As the enrichment process is imperfect, algorithms are needed to identify putative Y-specific reads prior to downstream assembly. A strategy that uses k-mer abundances to identify such reads was used to assemble the gorilla Y. However, the strategy required the manual setting of key parameters, a time-consuming process leading to sub-optimal assemblies. Results We develop a method, RecoverY, that selects Y-specific reads by automatically choosing the abundance level at which a k-mer is deemed to originate from the Y. This algorithm uses prior knowledge about the Y chromosome of a related species or known Y transcript sequences. We evaluate RecoverY on both simulated and real data, for human and gorilla, and investigate its robustness to important parameters. We show that RecoverY leads to a vastly superior assembly compared to alternate strategies of filtering the reads or contigs. Compared to the preliminary strategy used by Tomaszkiewicz et al., we achieve a 33% improvement in assembly size and a 20% improvement in the NG50, demonstrating the power of automatic parameter selection. Availability and implementation Our tool RecoverY is freely available at https://github.com/makovalab-psu/RecoverY. Contact kmakova@bx.psu.edu or pashadag@cse.psu.edu Supplementary informationSupplementary data are available at Bioinformatics online.
AB - Motivation The haploid mammalian Y chromosome is usually under-represented in genome assemblies due to high repeat content and low depth due to its haploid nature. One strategy to ameliorate the low coverage of Y sequences is to experimentally enrich Y-specific material before assembly. As the enrichment process is imperfect, algorithms are needed to identify putative Y-specific reads prior to downstream assembly. A strategy that uses k-mer abundances to identify such reads was used to assemble the gorilla Y. However, the strategy required the manual setting of key parameters, a time-consuming process leading to sub-optimal assemblies. Results We develop a method, RecoverY, that selects Y-specific reads by automatically choosing the abundance level at which a k-mer is deemed to originate from the Y. This algorithm uses prior knowledge about the Y chromosome of a related species or known Y transcript sequences. We evaluate RecoverY on both simulated and real data, for human and gorilla, and investigate its robustness to important parameters. We show that RecoverY leads to a vastly superior assembly compared to alternate strategies of filtering the reads or contigs. Compared to the preliminary strategy used by Tomaszkiewicz et al., we achieve a 33% improvement in assembly size and a 20% improvement in the NG50, demonstrating the power of automatic parameter selection. Availability and implementation Our tool RecoverY is freely available at https://github.com/makovalab-psu/RecoverY. Contact kmakova@bx.psu.edu or pashadag@cse.psu.edu Supplementary informationSupplementary data are available at Bioinformatics online.
UR - http://www.scopus.com/inward/record.url?scp=85045830143&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85045830143&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btx771
DO - 10.1093/bioinformatics/btx771
M3 - Article
C2 - 29194476
AN - SCOPUS:85045830143
SN - 1367-4803
VL - 34
SP - 1125
EP - 1131
JO - Bioinformatics
JF - Bioinformatics
IS - 7
ER -