SEK: Sparsity exploiting k-mer-based estimation of bacterial community composition

Saikat Chatterjee, David Koslicki, Siyuan Dong, Nicolas Innocenti, Lu Cheng, Yueheng Lan, Mikko Vehkaperä, Mikael Skoglund, Lars K. Rasmussen, Erik Aurell, Jukka Corander

    Research output: Contribution to journalArticle

    6 Citations (Scopus)

    Abstract

    Motivation: Estimation of bacterial community composition from a high-throughput sequenced sample is an important task in metagenomics applications. As the sample sequence data typically harbors reads of variable lengths and different levels of biological and technical noise, accurate statistical analysis of such data is challenging. Currently popular estimation methods are typically time-consuming in a desktop computing environment. Results: Using sparsity enforcing methods from the general sparse signal processing field (such as compressed sensing), we derive a solution to the community composition estimation problem by a simultaneous assignment of all sample reads to a pre-processed reference database. A general statistical model based on kernel density estimation techniques is introduced for the assignment task, and the model solution is obtained using convex optimization tools. Further, we design a greedy algorithm solution for a fast solution. Our approach offers a reasonably fast community composition estimation method, which is shown to be more robust to input data variation than a recently introduced related method.

    Original languageEnglish (US)
    Pages (from-to)2423-2431
    Number of pages9
    JournalBioinformatics
    Volume30
    Issue number17
    DOIs
    StatePublished - Sep 1 2014

    Fingerprint

    Sparsity
    Chemical analysis
    Task Assignment
    Metagenomics
    Statistical Data Interpretation
    Kernel Density Estimation
    Spatial Analysis
    Compressed Sensing
    Statistical Models
    Greedy Algorithm
    Convex Optimization
    Compressed sensing
    Statistical Model
    High Throughput
    Convex optimization
    Statistical Analysis
    Signal Processing
    Noise
    Ports and harbors
    Assignment

    All Science Journal Classification (ASJC) codes

    • Statistics and Probability
    • Biochemistry
    • Molecular Biology
    • Computer Science Applications
    • Computational Theory and Mathematics
    • Computational Mathematics

    Cite this

    Chatterjee, S., Koslicki, D., Dong, S., Innocenti, N., Cheng, L., Lan, Y., ... Corander, J. (2014). SEK: Sparsity exploiting k-mer-based estimation of bacterial community composition. Bioinformatics, 30(17), 2423-2431. https://doi.org/10.1093/bioinformatics/btu320
    Chatterjee, Saikat ; Koslicki, David ; Dong, Siyuan ; Innocenti, Nicolas ; Cheng, Lu ; Lan, Yueheng ; Vehkaperä, Mikko ; Skoglund, Mikael ; Rasmussen, Lars K. ; Aurell, Erik ; Corander, Jukka. / SEK : Sparsity exploiting k-mer-based estimation of bacterial community composition. In: Bioinformatics. 2014 ; Vol. 30, No. 17. pp. 2423-2431.
    @article{c08eafb9c7b64c7eb92c04acdeb22a90,
    title = "SEK: Sparsity exploiting k-mer-based estimation of bacterial community composition",
    abstract = "Motivation: Estimation of bacterial community composition from a high-throughput sequenced sample is an important task in metagenomics applications. As the sample sequence data typically harbors reads of variable lengths and different levels of biological and technical noise, accurate statistical analysis of such data is challenging. Currently popular estimation methods are typically time-consuming in a desktop computing environment. Results: Using sparsity enforcing methods from the general sparse signal processing field (such as compressed sensing), we derive a solution to the community composition estimation problem by a simultaneous assignment of all sample reads to a pre-processed reference database. A general statistical model based on kernel density estimation techniques is introduced for the assignment task, and the model solution is obtained using convex optimization tools. Further, we design a greedy algorithm solution for a fast solution. Our approach offers a reasonably fast community composition estimation method, which is shown to be more robust to input data variation than a recently introduced related method.",
    author = "Saikat Chatterjee and David Koslicki and Siyuan Dong and Nicolas Innocenti and Lu Cheng and Yueheng Lan and Mikko Vehkaper{\"a} and Mikael Skoglund and Rasmussen, {Lars K.} and Erik Aurell and Jukka Corander",
    year = "2014",
    month = "9",
    day = "1",
    doi = "10.1093/bioinformatics/btu320",
    language = "English (US)",
    volume = "30",
    pages = "2423--2431",
    journal = "Bioinformatics",
    issn = "1367-4803",
    publisher = "Oxford University Press",
    number = "17",

    }

    Chatterjee, S, Koslicki, D, Dong, S, Innocenti, N, Cheng, L, Lan, Y, Vehkaperä, M, Skoglund, M, Rasmussen, LK, Aurell, E & Corander, J 2014, 'SEK: Sparsity exploiting k-mer-based estimation of bacterial community composition', Bioinformatics, vol. 30, no. 17, pp. 2423-2431. https://doi.org/10.1093/bioinformatics/btu320

    SEK : Sparsity exploiting k-mer-based estimation of bacterial community composition. / Chatterjee, Saikat; Koslicki, David; Dong, Siyuan; Innocenti, Nicolas; Cheng, Lu; Lan, Yueheng; Vehkaperä, Mikko; Skoglund, Mikael; Rasmussen, Lars K.; Aurell, Erik; Corander, Jukka.

    In: Bioinformatics, Vol. 30, No. 17, 01.09.2014, p. 2423-2431.

    Research output: Contribution to journalArticle

    TY - JOUR

    T1 - SEK

    T2 - Sparsity exploiting k-mer-based estimation of bacterial community composition

    AU - Chatterjee, Saikat

    AU - Koslicki, David

    AU - Dong, Siyuan

    AU - Innocenti, Nicolas

    AU - Cheng, Lu

    AU - Lan, Yueheng

    AU - Vehkaperä, Mikko

    AU - Skoglund, Mikael

    AU - Rasmussen, Lars K.

    AU - Aurell, Erik

    AU - Corander, Jukka

    PY - 2014/9/1

    Y1 - 2014/9/1

    N2 - Motivation: Estimation of bacterial community composition from a high-throughput sequenced sample is an important task in metagenomics applications. As the sample sequence data typically harbors reads of variable lengths and different levels of biological and technical noise, accurate statistical analysis of such data is challenging. Currently popular estimation methods are typically time-consuming in a desktop computing environment. Results: Using sparsity enforcing methods from the general sparse signal processing field (such as compressed sensing), we derive a solution to the community composition estimation problem by a simultaneous assignment of all sample reads to a pre-processed reference database. A general statistical model based on kernel density estimation techniques is introduced for the assignment task, and the model solution is obtained using convex optimization tools. Further, we design a greedy algorithm solution for a fast solution. Our approach offers a reasonably fast community composition estimation method, which is shown to be more robust to input data variation than a recently introduced related method.

    AB - Motivation: Estimation of bacterial community composition from a high-throughput sequenced sample is an important task in metagenomics applications. As the sample sequence data typically harbors reads of variable lengths and different levels of biological and technical noise, accurate statistical analysis of such data is challenging. Currently popular estimation methods are typically time-consuming in a desktop computing environment. Results: Using sparsity enforcing methods from the general sparse signal processing field (such as compressed sensing), we derive a solution to the community composition estimation problem by a simultaneous assignment of all sample reads to a pre-processed reference database. A general statistical model based on kernel density estimation techniques is introduced for the assignment task, and the model solution is obtained using convex optimization tools. Further, we design a greedy algorithm solution for a fast solution. Our approach offers a reasonably fast community composition estimation method, which is shown to be more robust to input data variation than a recently introduced related method.

    UR - http://www.scopus.com/inward/record.url?scp=84907029456&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84907029456&partnerID=8YFLogxK

    U2 - 10.1093/bioinformatics/btu320

    DO - 10.1093/bioinformatics/btu320

    M3 - Article

    C2 - 24812337

    AN - SCOPUS:84907029456

    VL - 30

    SP - 2423

    EP - 2431

    JO - Bioinformatics

    JF - Bioinformatics

    SN - 1367-4803

    IS - 17

    ER -