TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes

Ilia Minkin, Son Pham, Paul Medvedev

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

Motivation: de Bruijn graphs have been proposed as a data structure to facilitate the analysis of related whole genome sequences, in both a population and comparative genomic settings. However, current approaches do not scale well to many genomes of large size (such as mammalian genomes).

Results: In this article, we present TwoPaCo, a simple and scalable low memory algorithm for the direct construction of the compacted de Bruijn graph from a set of complete genomes. We demonstrate that it can construct the graph for 100 simulated human genomes in less than a day and eight real primates in < 2 h, on a typical shared-memory machine. We believe that this progress will enable novel biological analyses of hundreds of mammalian-sized genomes.

Availability and Implementation: Our code and data is available for download from github.com/medvedevgroup/TwoPaCo.

Contact: ium125@psu.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

Original languageEnglish (US)
Pages (from-to)4024-4032
Number of pages9
JournalBioinformatics (Oxford, England)
Volume33
Issue number24
DOIs
StatePublished - Dec 15 2017

Fingerprint

De Bruijn Graph
Genome
Efficient Algorithms
Genes
Genome Size
Metagenomics
Human Genome
Computational Biology
Primates
Data storage equipment
Comparative Genomics
Bioinformatics
Shared Memory
Data structures
Data Structures
Availability
Contact
Graph in graph theory
Demonstrate

All Science Journal Classification (ASJC) codes

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Cite this

Minkin, Ilia ; Pham, Son ; Medvedev, Paul. / TwoPaCo : an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. In: Bioinformatics (Oxford, England). 2017 ; Vol. 33, No. 24. pp. 4024-4032.
@article{fa5cda2eeb5f462e88b18c5840c4db25,
title = "TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes",
abstract = "Motivation: de Bruijn graphs have been proposed as a data structure to facilitate the analysis of related whole genome sequences, in both a population and comparative genomic settings. However, current approaches do not scale well to many genomes of large size (such as mammalian genomes).Results: In this article, we present TwoPaCo, a simple and scalable low memory algorithm for the direct construction of the compacted de Bruijn graph from a set of complete genomes. We demonstrate that it can construct the graph for 100 simulated human genomes in less than a day and eight real primates in < 2 h, on a typical shared-memory machine. We believe that this progress will enable novel biological analyses of hundreds of mammalian-sized genomes.Availability and Implementation: Our code and data is available for download from github.com/medvedevgroup/TwoPaCo.Contact: ium125@psu.edu.Supplementary information: Supplementary data are available at Bioinformatics online.",
author = "Ilia Minkin and Son Pham and Paul Medvedev",
year = "2017",
month = "12",
day = "15",
doi = "10.1093/bioinformatics/btw609",
language = "English (US)",
volume = "33",
pages = "4024--4032",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "24",

}

TwoPaCo : an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. / Minkin, Ilia; Pham, Son; Medvedev, Paul.

In: Bioinformatics (Oxford, England), Vol. 33, No. 24, 15.12.2017, p. 4024-4032.

Research output: Contribution to journalArticle

TY - JOUR

T1 - TwoPaCo

T2 - an efficient algorithm to build the compacted de Bruijn graph from many complete genomes

AU - Minkin, Ilia

AU - Pham, Son

AU - Medvedev, Paul

PY - 2017/12/15

Y1 - 2017/12/15

N2 - Motivation: de Bruijn graphs have been proposed as a data structure to facilitate the analysis of related whole genome sequences, in both a population and comparative genomic settings. However, current approaches do not scale well to many genomes of large size (such as mammalian genomes).Results: In this article, we present TwoPaCo, a simple and scalable low memory algorithm for the direct construction of the compacted de Bruijn graph from a set of complete genomes. We demonstrate that it can construct the graph for 100 simulated human genomes in less than a day and eight real primates in < 2 h, on a typical shared-memory machine. We believe that this progress will enable novel biological analyses of hundreds of mammalian-sized genomes.Availability and Implementation: Our code and data is available for download from github.com/medvedevgroup/TwoPaCo.Contact: ium125@psu.edu.Supplementary information: Supplementary data are available at Bioinformatics online.

AB - Motivation: de Bruijn graphs have been proposed as a data structure to facilitate the analysis of related whole genome sequences, in both a population and comparative genomic settings. However, current approaches do not scale well to many genomes of large size (such as mammalian genomes).Results: In this article, we present TwoPaCo, a simple and scalable low memory algorithm for the direct construction of the compacted de Bruijn graph from a set of complete genomes. We demonstrate that it can construct the graph for 100 simulated human genomes in less than a day and eight real primates in < 2 h, on a typical shared-memory machine. We believe that this progress will enable novel biological analyses of hundreds of mammalian-sized genomes.Availability and Implementation: Our code and data is available for download from github.com/medvedevgroup/TwoPaCo.Contact: ium125@psu.edu.Supplementary information: Supplementary data are available at Bioinformatics online.

UR - http://www.scopus.com/inward/record.url?scp=85053491509&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85053491509&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btw609

DO - 10.1093/bioinformatics/btw609

M3 - Article

C2 - 27659452

AN - SCOPUS:85053491509

VL - 33

SP - 4024

EP - 4032

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 24

ER -