Informed and automated k-mer size selection for genome assembly

Rayan Chikhi, Paul Medvedev

Research output: Contribution to journalArticle

251 Citations (Scopus)

Abstract

Motivation: Genome assembly tools based on the de Bruijn graph framework rely on a parameter k, which represents a trade-off between several competing effects that are difficult to quantify. There is currently a lack of tools that would automatically estimate the best k to use and/or quickly generate histograms of k-mer abundances that would allow the user to make an informed decision.Results: We develop a fast and accurate sampling method that constructs approximate abundance histograms with several orders of magnitude performance improvement over traditional methods. We then present a fast heuristic that uses the generated abundance histograms for putative k values to estimate the best possible value of k. We test the effectiveness of our tool using diverse sequencing datasets and find that its choice of k leads to some of the best assemblies.Availability: Our tool KmerGenie is freely available at: http://kmergenie.bx.psu.edu/.Contact:

Original languageEnglish (US)
Pages (from-to)31-37
Number of pages7
JournalBioinformatics
Volume30
Issue number1
DOIs
StatePublished - Jan 1 2014

Fingerprint

Genome Size
Genome
Genes
Histogram
De Bruijn Graph
Sampling Methods
Estimate
Sequencing
Quantify
Availability
Trade-offs
Heuristics
Contact
Sampling
Datasets

All Science Journal Classification (ASJC) codes

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Cite this

Chikhi, Rayan ; Medvedev, Paul. / Informed and automated k-mer size selection for genome assembly. In: Bioinformatics. 2014 ; Vol. 30, No. 1. pp. 31-37.
@article{da23c836da9548beb4c23f8dfc68f80d,
title = "Informed and automated k-mer size selection for genome assembly",
abstract = "Motivation: Genome assembly tools based on the de Bruijn graph framework rely on a parameter k, which represents a trade-off between several competing effects that are difficult to quantify. There is currently a lack of tools that would automatically estimate the best k to use and/or quickly generate histograms of k-mer abundances that would allow the user to make an informed decision.Results: We develop a fast and accurate sampling method that constructs approximate abundance histograms with several orders of magnitude performance improvement over traditional methods. We then present a fast heuristic that uses the generated abundance histograms for putative k values to estimate the best possible value of k. We test the effectiveness of our tool using diverse sequencing datasets and find that its choice of k leads to some of the best assemblies.Availability: Our tool KmerGenie is freely available at: http://kmergenie.bx.psu.edu/.Contact:",
author = "Rayan Chikhi and Paul Medvedev",
year = "2014",
month = "1",
day = "1",
doi = "10.1093/bioinformatics/btt310",
language = "English (US)",
volume = "30",
pages = "31--37",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "1",

}

Informed and automated k-mer size selection for genome assembly. / Chikhi, Rayan; Medvedev, Paul.

In: Bioinformatics, Vol. 30, No. 1, 01.01.2014, p. 31-37.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Informed and automated k-mer size selection for genome assembly

AU - Chikhi, Rayan

AU - Medvedev, Paul

PY - 2014/1/1

Y1 - 2014/1/1

N2 - Motivation: Genome assembly tools based on the de Bruijn graph framework rely on a parameter k, which represents a trade-off between several competing effects that are difficult to quantify. There is currently a lack of tools that would automatically estimate the best k to use and/or quickly generate histograms of k-mer abundances that would allow the user to make an informed decision.Results: We develop a fast and accurate sampling method that constructs approximate abundance histograms with several orders of magnitude performance improvement over traditional methods. We then present a fast heuristic that uses the generated abundance histograms for putative k values to estimate the best possible value of k. We test the effectiveness of our tool using diverse sequencing datasets and find that its choice of k leads to some of the best assemblies.Availability: Our tool KmerGenie is freely available at: http://kmergenie.bx.psu.edu/.Contact:

AB - Motivation: Genome assembly tools based on the de Bruijn graph framework rely on a parameter k, which represents a trade-off between several competing effects that are difficult to quantify. There is currently a lack of tools that would automatically estimate the best k to use and/or quickly generate histograms of k-mer abundances that would allow the user to make an informed decision.Results: We develop a fast and accurate sampling method that constructs approximate abundance histograms with several orders of magnitude performance improvement over traditional methods. We then present a fast heuristic that uses the generated abundance histograms for putative k values to estimate the best possible value of k. We test the effectiveness of our tool using diverse sequencing datasets and find that its choice of k leads to some of the best assemblies.Availability: Our tool KmerGenie is freely available at: http://kmergenie.bx.psu.edu/.Contact:

UR - http://www.scopus.com/inward/record.url?scp=84891349005&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84891349005&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btt310

DO - 10.1093/bioinformatics/btt310

M3 - Article

C2 - 23732276

AN - SCOPUS:84891349005

VL - 30

SP - 31

EP - 37

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 1

ER -