TY - JOUR
T1 - Progressive Cactus is a multiple-genome aligner for the thousand-genome era
AU - Armstrong, Joel
AU - Hickey, Glenn
AU - Diekhans, Mark
AU - Fiddes, Ian T.
AU - Novak, Adam M.
AU - Deran, Alden
AU - Fang, Qi
AU - Xie, Duo
AU - Feng, Shaohong
AU - Stiller, Josefin
AU - Genereux, Diane
AU - Johnson, Jeremy
AU - Marinescu, Voichita Dana
AU - Alföldi, Jessica
AU - Harris, Robert S.
AU - Lindblad-Toh, Kerstin
AU - Haussler, David
AU - Karlsson, Elinor
AU - Jarvis, Erich D.
AU - Zhang, Guojie
AU - Paten, Benedict
N1 - Funding Information:
Acknowledgements The research reported in this publication was supported by the National Institutes of Health (NIH) under award numbers U01HG010961, U41HG010972, R01HG010485, 2U41HG007234, 5U54HG007990, 5T32HG008345-04 and U01HL137183. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. Parts of this work and its text were also included in J.A.’s PhD thesis63.
Publisher Copyright:
© 2020, The Author(s).
PY - 2020/11/12
Y1 - 2020/11/12
N2 - New genome assemblies have been arriving at a rapidly increasing pace, thanks to decreases in sequencing costs and improvements in third-generation sequencing technologies1–3. For example, the number of vertebrate genome assemblies currently in the NCBI (National Center for Biotechnology Information) database4 increased by more than 50% to 1,485 assemblies in the year from July 2018 to July 2019. In addition to this influx of assemblies from different species, new human de novo assemblies5 are being produced, which enable the analysis of not only small polymorphisms, but also complex, large-scale structural differences between human individuals and haplotypes. This coming era and its unprecedented amount of data offer the opportunity to uncover many insights into genome evolution but also present challenges in how to adapt current analysis methods to meet the increased scale. Cactus6, a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequences. Here we describe progressive extensions to Cactus to create Progressive Cactus, which enables the reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We describe results from an alignment of more than 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment created so far.
AB - New genome assemblies have been arriving at a rapidly increasing pace, thanks to decreases in sequencing costs and improvements in third-generation sequencing technologies1–3. For example, the number of vertebrate genome assemblies currently in the NCBI (National Center for Biotechnology Information) database4 increased by more than 50% to 1,485 assemblies in the year from July 2018 to July 2019. In addition to this influx of assemblies from different species, new human de novo assemblies5 are being produced, which enable the analysis of not only small polymorphisms, but also complex, large-scale structural differences between human individuals and haplotypes. This coming era and its unprecedented amount of data offer the opportunity to uncover many insights into genome evolution but also present challenges in how to adapt current analysis methods to meet the increased scale. Cactus6, a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequences. Here we describe progressive extensions to Cactus to create Progressive Cactus, which enables the reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We describe results from an alignment of more than 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment created so far.
UR - http://www.scopus.com/inward/record.url?scp=85095806143&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85095806143&partnerID=8YFLogxK
U2 - 10.1038/s41586-020-2871-y
DO - 10.1038/s41586-020-2871-y
M3 - Article
C2 - 33177663
AN - SCOPUS:85095806143
SN - 0028-0836
VL - 587
SP - 246
EP - 251
JO - Nature
JF - Nature
IS - 7833
ER -