TY - JOUR

T1 - The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches

AU - Blanca, Antonio

AU - Harris, Robert S.

AU - Koslicki, David

AU - Medvedev, Paul

N1 - Funding Information:
P.M. was supported by NSF awards 1453527 and 1439057. A.B. was supported, in part, by the NSF grant CCF-1850443. This material is based upon work supported by the National Science Foundation under Grant No. 1664803.
Publisher Copyright:
© Copyright 2022, Mary Ann Liebert, Inc., publishers 2022.

PY - 2022/2

Y1 - 2022/2

N2 - k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.

AB - k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.

UR - http://www.scopus.com/inward/record.url?scp=85124999827&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85124999827&partnerID=8YFLogxK

U2 - 10.1089/cmb.2021.0431

DO - 10.1089/cmb.2021.0431

M3 - Article

C2 - 35108101

AN - SCOPUS:85124999827

SN - 1066-5277

VL - 29

SP - 155

EP - 168

JO - Journal of Computational Biology

JF - Journal of Computational Biology

IS - 2

ER -