TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes

Ilia Minkin, Son Pham, Paul Medvedev

Research output: Contribution to journalArticle

10 Scopus citations

Abstract

Motivation: de Bruijn graphs have been proposed as a data structure to facilitate the analysis of related whole genome sequences, in both a population and comparative genomic settings. However, current approaches do not scale well to many genomes of large size (such as mammalian genomes).

Results: In this article, we present TwoPaCo, a simple and scalable low memory algorithm for the direct construction of the compacted de Bruijn graph from a set of complete genomes. We demonstrate that it can construct the graph for 100 simulated human genomes in less than a day and eight real primates in < 2 h, on a typical shared-memory machine. We believe that this progress will enable novel biological analyses of hundreds of mammalian-sized genomes.

Availability and Implementation: Our code and data is available for download from github.com/medvedevgroup/TwoPaCo.

Contact: ium125@psu.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

Original languageEnglish (US)
Pages (from-to)4024-4032
Number of pages9
JournalBioinformatics (Oxford, England)
Volume33
Issue number24
DOIs
StatePublished - Dec 15 2017

All Science Journal Classification (ASJC) codes

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Fingerprint Dive into the research topics of 'TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes'. Together they form a unique fingerprint.

Cite this