ARK: Aggregation of reads by k-means for estimation of bacterial community composition

David Koslicki, Saikat Chatterjee, Damon Shahrivar, Alan W. Walker, Suzanna C. Francis, Louise J. Fraser, Mikko Vehkaperä, Yueheng Lan, Jukka Corander

    Research output: Contribution to journalArticle

    2 Citations (Scopus)

    Abstract

    Motivation Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging. Results There has been a recent surge of interest in using compressed sensing inspired and convex- optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity. Availability An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.

    Original languageEnglish (US)
    Article numbere0140644
    JournalPloS one
    Volume10
    Issue number10
    DOIs
    StatePublished - Oct 23 2015

    Fingerprint

    bacterial communities
    Agglomeration
    Chemical analysis
    Cluster Analysis
    Compressed sensing
    Convex optimization
    Programming Languages
    Ecology
    methodology
    Clustering algorithms
    Computer programming languages
    artificial intelligence
    microbial ecology
    Learning systems
    Computational complexity
    rRNA Genes
    sampling
    Genes
    Throughput
    Statistics

    All Science Journal Classification (ASJC) codes

    • Biochemistry, Genetics and Molecular Biology(all)
    • Agricultural and Biological Sciences(all)

    Cite this

    Koslicki, D., Chatterjee, S., Shahrivar, D., Walker, A. W., Francis, S. C., Fraser, L. J., ... Corander, J. (2015). ARK: Aggregation of reads by k-means for estimation of bacterial community composition. PloS one, 10(10), [e0140644]. https://doi.org/10.1371/journal.pone.0140644
    Koslicki, David ; Chatterjee, Saikat ; Shahrivar, Damon ; Walker, Alan W. ; Francis, Suzanna C. ; Fraser, Louise J. ; Vehkaperä, Mikko ; Lan, Yueheng ; Corander, Jukka. / ARK : Aggregation of reads by k-means for estimation of bacterial community composition. In: PloS one. 2015 ; Vol. 10, No. 10.
    @article{2daef97a0bb0465fb925835701a4876c,
    title = "ARK: Aggregation of reads by k-means for estimation of bacterial community composition",
    abstract = "Motivation Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging. Results There has been a recent surge of interest in using compressed sensing inspired and convex- optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity. Availability An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.",
    author = "David Koslicki and Saikat Chatterjee and Damon Shahrivar and Walker, {Alan W.} and Francis, {Suzanna C.} and Fraser, {Louise J.} and Mikko Vehkaper{\"a} and Yueheng Lan and Jukka Corander",
    year = "2015",
    month = "10",
    day = "23",
    doi = "10.1371/journal.pone.0140644",
    language = "English (US)",
    volume = "10",
    journal = "PLoS One",
    issn = "1932-6203",
    publisher = "Public Library of Science",
    number = "10",

    }

    Koslicki, D, Chatterjee, S, Shahrivar, D, Walker, AW, Francis, SC, Fraser, LJ, Vehkaperä, M, Lan, Y & Corander, J 2015, 'ARK: Aggregation of reads by k-means for estimation of bacterial community composition', PloS one, vol. 10, no. 10, e0140644. https://doi.org/10.1371/journal.pone.0140644

    ARK : Aggregation of reads by k-means for estimation of bacterial community composition. / Koslicki, David; Chatterjee, Saikat; Shahrivar, Damon; Walker, Alan W.; Francis, Suzanna C.; Fraser, Louise J.; Vehkaperä, Mikko; Lan, Yueheng; Corander, Jukka.

    In: PloS one, Vol. 10, No. 10, e0140644, 23.10.2015.

    Research output: Contribution to journalArticle

    TY - JOUR

    T1 - ARK

    T2 - Aggregation of reads by k-means for estimation of bacterial community composition

    AU - Koslicki, David

    AU - Chatterjee, Saikat

    AU - Shahrivar, Damon

    AU - Walker, Alan W.

    AU - Francis, Suzanna C.

    AU - Fraser, Louise J.

    AU - Vehkaperä, Mikko

    AU - Lan, Yueheng

    AU - Corander, Jukka

    PY - 2015/10/23

    Y1 - 2015/10/23

    N2 - Motivation Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging. Results There has been a recent surge of interest in using compressed sensing inspired and convex- optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity. Availability An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.

    AB - Motivation Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging. Results There has been a recent surge of interest in using compressed sensing inspired and convex- optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity. Availability An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.

    UR - http://www.scopus.com/inward/record.url?scp=84949460421&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84949460421&partnerID=8YFLogxK

    U2 - 10.1371/journal.pone.0140644

    DO - 10.1371/journal.pone.0140644

    M3 - Article

    C2 - 26496191

    AN - SCOPUS:84949460421

    VL - 10

    JO - PLoS One

    JF - PLoS One

    SN - 1932-6203

    IS - 10

    M1 - e0140644

    ER -

    Koslicki D, Chatterjee S, Shahrivar D, Walker AW, Francis SC, Fraser LJ et al. ARK: Aggregation of reads by k-means for estimation of bacterial community composition. PloS one. 2015 Oct 23;10(10). e0140644. https://doi.org/10.1371/journal.pone.0140644