Compacting de Bruijn graphs from sequencing data quickly and in low memory

Rayan Chikhi, Antoine Limasset, Paul Medvedev

Research output: Contribution to journalArticle

17 Citations (Scopus)

Abstract

Motivation: As the quantity of data per sequencing experiment increases, the challenges of fragment assembly are becoming increasingly computational. The de Bruijn graph is a widely used data structure in fragment assembly algorithms, used to represent the information from a set of reads. Compaction is an important data reduction step in most de Bruijn graph based algorithms where long simple paths are compacted into single vertices. Compaction has recently become the bottleneck in assembly pipelines, and improving its running time and memory usage is an important problem. Results: We present an algorithm and a tool bcalm 2 for the compaction of de Bruijn graphs. bcalm 2 is a parallel algorithm that distributes the input based on a minimizer hashing technique, allowing for good balance of memory usage throughout its execution. For human sequencing data, bcalm 2 reduces the computational burden of compacting the de Bruijn graph to roughly an hour and 3 GB of memory. We also applied bcalm 2 to the 22 Gbp loblolly pine and 20 Gbp white spruce sequencing datasets. Compacted graphs were constructed from raw reads in less than 2 days and 40 GB of memory on a single machine. Hence, bcalm 2 is at least an order of magnitude more efficient than other available methods.

Original languageEnglish (US)
Pages (from-to)i201-i208
JournalBioinformatics
Volume32
Issue number12
DOIs
StatePublished - Jun 15 2016

Fingerprint

De Bruijn Graph
Sequencing
Compaction
Data storage equipment
Fragment
Pinus taeda
Data Reduction
Hashing
Single Machine
Minimizer
Parallel algorithms
Running
Parallel Algorithms
Data structures
Data reduction
Data Structures
Pipelines
Path
Graph in graph theory
Experiment

All Science Journal Classification (ASJC) codes

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Cite this

Chikhi, Rayan ; Limasset, Antoine ; Medvedev, Paul. / Compacting de Bruijn graphs from sequencing data quickly and in low memory. In: Bioinformatics. 2016 ; Vol. 32, No. 12. pp. i201-i208.
@article{ffd0e3555f5d4b658dc23008b5f570b5,
title = "Compacting de Bruijn graphs from sequencing data quickly and in low memory",
abstract = "Motivation: As the quantity of data per sequencing experiment increases, the challenges of fragment assembly are becoming increasingly computational. The de Bruijn graph is a widely used data structure in fragment assembly algorithms, used to represent the information from a set of reads. Compaction is an important data reduction step in most de Bruijn graph based algorithms where long simple paths are compacted into single vertices. Compaction has recently become the bottleneck in assembly pipelines, and improving its running time and memory usage is an important problem. Results: We present an algorithm and a tool bcalm 2 for the compaction of de Bruijn graphs. bcalm 2 is a parallel algorithm that distributes the input based on a minimizer hashing technique, allowing for good balance of memory usage throughout its execution. For human sequencing data, bcalm 2 reduces the computational burden of compacting the de Bruijn graph to roughly an hour and 3 GB of memory. We also applied bcalm 2 to the 22 Gbp loblolly pine and 20 Gbp white spruce sequencing datasets. Compacted graphs were constructed from raw reads in less than 2 days and 40 GB of memory on a single machine. Hence, bcalm 2 is at least an order of magnitude more efficient than other available methods.",
author = "Rayan Chikhi and Antoine Limasset and Paul Medvedev",
year = "2016",
month = "6",
day = "15",
doi = "10.1093/bioinformatics/btw279",
language = "English (US)",
volume = "32",
pages = "i201--i208",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "12",

}

Compacting de Bruijn graphs from sequencing data quickly and in low memory. / Chikhi, Rayan; Limasset, Antoine; Medvedev, Paul.

In: Bioinformatics, Vol. 32, No. 12, 15.06.2016, p. i201-i208.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Compacting de Bruijn graphs from sequencing data quickly and in low memory

AU - Chikhi, Rayan

AU - Limasset, Antoine

AU - Medvedev, Paul

PY - 2016/6/15

Y1 - 2016/6/15

N2 - Motivation: As the quantity of data per sequencing experiment increases, the challenges of fragment assembly are becoming increasingly computational. The de Bruijn graph is a widely used data structure in fragment assembly algorithms, used to represent the information from a set of reads. Compaction is an important data reduction step in most de Bruijn graph based algorithms where long simple paths are compacted into single vertices. Compaction has recently become the bottleneck in assembly pipelines, and improving its running time and memory usage is an important problem. Results: We present an algorithm and a tool bcalm 2 for the compaction of de Bruijn graphs. bcalm 2 is a parallel algorithm that distributes the input based on a minimizer hashing technique, allowing for good balance of memory usage throughout its execution. For human sequencing data, bcalm 2 reduces the computational burden of compacting the de Bruijn graph to roughly an hour and 3 GB of memory. We also applied bcalm 2 to the 22 Gbp loblolly pine and 20 Gbp white spruce sequencing datasets. Compacted graphs were constructed from raw reads in less than 2 days and 40 GB of memory on a single machine. Hence, bcalm 2 is at least an order of magnitude more efficient than other available methods.

AB - Motivation: As the quantity of data per sequencing experiment increases, the challenges of fragment assembly are becoming increasingly computational. The de Bruijn graph is a widely used data structure in fragment assembly algorithms, used to represent the information from a set of reads. Compaction is an important data reduction step in most de Bruijn graph based algorithms where long simple paths are compacted into single vertices. Compaction has recently become the bottleneck in assembly pipelines, and improving its running time and memory usage is an important problem. Results: We present an algorithm and a tool bcalm 2 for the compaction of de Bruijn graphs. bcalm 2 is a parallel algorithm that distributes the input based on a minimizer hashing technique, allowing for good balance of memory usage throughout its execution. For human sequencing data, bcalm 2 reduces the computational burden of compacting the de Bruijn graph to roughly an hour and 3 GB of memory. We also applied bcalm 2 to the 22 Gbp loblolly pine and 20 Gbp white spruce sequencing datasets. Compacted graphs were constructed from raw reads in less than 2 days and 40 GB of memory on a single machine. Hence, bcalm 2 is at least an order of magnitude more efficient than other available methods.

UR - http://www.scopus.com/inward/record.url?scp=84976495714&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84976495714&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btw279

DO - 10.1093/bioinformatics/btw279

M3 - Article

C2 - 27307618

AN - SCOPUS:84976495714

VL - 32

SP - i201-i208

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 12

ER -