MetaPalette: a k-mer Painting Approach for Metagenomic Taxonomic Profiling and Quantification of Novel Strain Variation

David Koslicki, Daniel Falush

    Research output: Contribution to journalArticle

    12 Citations (Scopus)

    Abstract

    Metagenomic profiling is challenging in part because of the highly uneven sampling of the tree of life by genome sequencing projects and the limitations imposed by performing phylogenetic inference at fixed taxonomic ranks. We present the algorithm MetaPalette, which uses long k-mer sizes (k 30, 50) to fit a k-mer “palette” of a given sample to the k-mer palette of reference organisms. By modeling the k-mer palettes of unknown organisms, the method also gives an indication of the presence, abundance, and evolutionary relatedness of novel organisms present in the sample. The method returns a traditional, fixed-rank taxonomic profile which is shown on independently simulated data to be one of the most accurate to date. Tree figures are also returned that quantify the relatedness of novel organisms to reference sequences, and the accuracy of such figures is demonstrated on simulated spike-ins and a metagenomic soil sample. The software implementing MetaPalette is available at: https://github.com/dkoslicki/MetaPalette. Pretrained databases are included for Archaea, Bacteria, Eukaryota, and viruses. IMPORTANCE Taxonomic profiling is a challenging first step when analyzing a metagenomic sample. This work presents a method that facilitates fine-scale characterization of the presence, abundance, and evolutionary relatedness of organisms present in a given sample but absent from the training database. We calculate a “k-mer palette” which summarizes the information from all reads, not just those in conserved genes or containing taxon-specific markers. The compositions of palettes are easy to model, allowing rapid inference of community composition. In addition to providing strain-level information where applicable, our approach provides taxonomic profiles that are more accurate than those of competing methods.

    Original languageEnglish (US)
    Article numbere00020-16
    JournalmSystems
    Volume1
    Issue number3
    DOIs
    StatePublished - May 1 2016

    Fingerprint

    Metagenomics
    Paintings
    strain differences
    Painting
    Profiling
    Quantification
    Genes
    relatedness
    organisms
    Chemical analysis
    Viruses
    Bacteria
    Databases
    Sampling
    Figure
    Soils
    sampling
    Archaea
    Eukaryota
    Phylogenetics

    All Science Journal Classification (ASJC) codes

    • Microbiology
    • Physiology
    • Biochemistry
    • Ecology, Evolution, Behavior and Systematics
    • Modeling and Simulation
    • Molecular Biology
    • Genetics
    • Computer Science Applications

    Cite this

    @article{1c66ce5fe7e54df09cb9e9857d8275f1,
    title = "MetaPalette: a k-mer Painting Approach for Metagenomic Taxonomic Profiling and Quantification of Novel Strain Variation",
    abstract = "Metagenomic profiling is challenging in part because of the highly uneven sampling of the tree of life by genome sequencing projects and the limitations imposed by performing phylogenetic inference at fixed taxonomic ranks. We present the algorithm MetaPalette, which uses long k-mer sizes (k 30, 50) to fit a k-mer “palette” of a given sample to the k-mer palette of reference organisms. By modeling the k-mer palettes of unknown organisms, the method also gives an indication of the presence, abundance, and evolutionary relatedness of novel organisms present in the sample. The method returns a traditional, fixed-rank taxonomic profile which is shown on independently simulated data to be one of the most accurate to date. Tree figures are also returned that quantify the relatedness of novel organisms to reference sequences, and the accuracy of such figures is demonstrated on simulated spike-ins and a metagenomic soil sample. The software implementing MetaPalette is available at: https://github.com/dkoslicki/MetaPalette. Pretrained databases are included for Archaea, Bacteria, Eukaryota, and viruses. IMPORTANCE Taxonomic profiling is a challenging first step when analyzing a metagenomic sample. This work presents a method that facilitates fine-scale characterization of the presence, abundance, and evolutionary relatedness of organisms present in a given sample but absent from the training database. We calculate a “k-mer palette” which summarizes the information from all reads, not just those in conserved genes or containing taxon-specific markers. The compositions of palettes are easy to model, allowing rapid inference of community composition. In addition to providing strain-level information where applicable, our approach provides taxonomic profiles that are more accurate than those of competing methods.",
    author = "David Koslicki and Daniel Falush",
    year = "2016",
    month = "5",
    day = "1",
    doi = "10.1128/mSystems.00020-16",
    language = "English (US)",
    volume = "1",
    journal = "mSystems",
    issn = "2379-5077",
    publisher = "American Society for Microbiology",
    number = "3",

    }

    MetaPalette : a k-mer Painting Approach for Metagenomic Taxonomic Profiling and Quantification of Novel Strain Variation. / Koslicki, David; Falush, Daniel.

    In: mSystems, Vol. 1, No. 3, e00020-16, 01.05.2016.

    Research output: Contribution to journalArticle

    TY - JOUR

    T1 - MetaPalette

    T2 - a k-mer Painting Approach for Metagenomic Taxonomic Profiling and Quantification of Novel Strain Variation

    AU - Koslicki, David

    AU - Falush, Daniel

    PY - 2016/5/1

    Y1 - 2016/5/1

    N2 - Metagenomic profiling is challenging in part because of the highly uneven sampling of the tree of life by genome sequencing projects and the limitations imposed by performing phylogenetic inference at fixed taxonomic ranks. We present the algorithm MetaPalette, which uses long k-mer sizes (k 30, 50) to fit a k-mer “palette” of a given sample to the k-mer palette of reference organisms. By modeling the k-mer palettes of unknown organisms, the method also gives an indication of the presence, abundance, and evolutionary relatedness of novel organisms present in the sample. The method returns a traditional, fixed-rank taxonomic profile which is shown on independently simulated data to be one of the most accurate to date. Tree figures are also returned that quantify the relatedness of novel organisms to reference sequences, and the accuracy of such figures is demonstrated on simulated spike-ins and a metagenomic soil sample. The software implementing MetaPalette is available at: https://github.com/dkoslicki/MetaPalette. Pretrained databases are included for Archaea, Bacteria, Eukaryota, and viruses. IMPORTANCE Taxonomic profiling is a challenging first step when analyzing a metagenomic sample. This work presents a method that facilitates fine-scale characterization of the presence, abundance, and evolutionary relatedness of organisms present in a given sample but absent from the training database. We calculate a “k-mer palette” which summarizes the information from all reads, not just those in conserved genes or containing taxon-specific markers. The compositions of palettes are easy to model, allowing rapid inference of community composition. In addition to providing strain-level information where applicable, our approach provides taxonomic profiles that are more accurate than those of competing methods.

    AB - Metagenomic profiling is challenging in part because of the highly uneven sampling of the tree of life by genome sequencing projects and the limitations imposed by performing phylogenetic inference at fixed taxonomic ranks. We present the algorithm MetaPalette, which uses long k-mer sizes (k 30, 50) to fit a k-mer “palette” of a given sample to the k-mer palette of reference organisms. By modeling the k-mer palettes of unknown organisms, the method also gives an indication of the presence, abundance, and evolutionary relatedness of novel organisms present in the sample. The method returns a traditional, fixed-rank taxonomic profile which is shown on independently simulated data to be one of the most accurate to date. Tree figures are also returned that quantify the relatedness of novel organisms to reference sequences, and the accuracy of such figures is demonstrated on simulated spike-ins and a metagenomic soil sample. The software implementing MetaPalette is available at: https://github.com/dkoslicki/MetaPalette. Pretrained databases are included for Archaea, Bacteria, Eukaryota, and viruses. IMPORTANCE Taxonomic profiling is a challenging first step when analyzing a metagenomic sample. This work presents a method that facilitates fine-scale characterization of the presence, abundance, and evolutionary relatedness of organisms present in a given sample but absent from the training database. We calculate a “k-mer palette” which summarizes the information from all reads, not just those in conserved genes or containing taxon-specific markers. The compositions of palettes are easy to model, allowing rapid inference of community composition. In addition to providing strain-level information where applicable, our approach provides taxonomic profiles that are more accurate than those of competing methods.

    UR - http://www.scopus.com/inward/record.url?scp=85020831132&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85020831132&partnerID=8YFLogxK

    U2 - 10.1128/mSystems.00020-16

    DO - 10.1128/mSystems.00020-16

    M3 - Article

    AN - SCOPUS:85020831132

    VL - 1

    JO - mSystems

    JF - mSystems

    SN - 2379-5077

    IS - 3

    M1 - e00020-16

    ER -