A critical task in microbiome data analysis is to explore the association between a scalar response of interest and a large number of microbial taxa that are summarized as compositional data at different taxonomic levels. Motivated by fine-mapping of the microbiome, we propose a two-step compositional knockoff filter to provide the effective finite-sample false discovery rate (FDR) control in high-dimensional linear log-contrast regression analysis of microbiome compositional data. In the first step, we propose a new compositional screening procedure to remove insignificant microbial taxa while retaining the essential sum-to-zero constraint. In the second step, we extend the knockoff filter to identify the significant microbial taxa in the sparse regression model for compositional data. Thereby, a subset of the microbes is selected from the high-dimensional microbial taxa as related to the response under a prespecified FDR threshold. We study the theoretical properties of the proposed two-step procedure, including both sure screening and effective false discovery control. We demonstrate these properties in numerical simulation studies to compare our methods to some existing ones and show power gain of the new method while controlling the nominal FDR. The potential usefulness of the proposed method is also illustrated with application to an inflammatory bowel disease data set to identify microbial taxa that influence host gene expressions.
All Science Journal Classification (ASJC) codes
- Statistics and Probability
- Biochemistry, Genetics and Molecular Biology(all)
- Immunology and Microbiology(all)
- Agricultural and Biological Sciences(all)
- Applied Mathematics