TY - JOUR
T1 - The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches
AU - Blanca, Antonio
AU - Harris, Robert S.
AU - Koslicki, David
AU - Medvedev, Paul
N1 - Funding Information:
P.M. was supported by NSF awards 1453527 and 1439057. A.B. was supported, in part, by the NSF grant CCF-1850443. This material is based upon work supported by the National Science Foundation under Grant No. 1664803.
Publisher Copyright:
© Copyright 2022, Mary Ann Liebert, Inc., publishers 2022.
PY - 2022/2
Y1 - 2022/2
N2 - k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.
AB - k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.
UR - http://www.scopus.com/inward/record.url?scp=85124999827&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85124999827&partnerID=8YFLogxK
U2 - 10.1089/cmb.2021.0431
DO - 10.1089/cmb.2021.0431
M3 - Article
C2 - 35108101
AN - SCOPUS:85124999827
SN - 1066-5277
VL - 29
SP - 155
EP - 168
JO - Journal of Computational Biology
JF - Journal of Computational Biology
IS - 2
ER -