TY - JOUR
T1 - A practical guide to understanding and validating complex models using data simulations
AU - DiRenzo, Graziella V.
AU - Hanks, Ephraim
AU - Miller, David A.W.
N1 - Funding Information:
We thank L. M. Browne for constructive comments on a previous version of this manuscript. Any use of trade, firm or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.
Publisher Copyright:
© 2022 The Authors. Methods in Ecology and Evolution published by John Wiley & Sons Ltd on behalf of British Ecological Society. This article has been contributed to by U.S. Government employees and their work is in the public domain in the USA.
PY - 2023/1
Y1 - 2023/1
N2 - Biologists routinely fit novel and complex statistical models to push the limits of our understanding. Examples include, but are not limited to, flexible Bayesian approaches (e.g. BUGS, stan), frequentist and likelihood-based approaches (e.g. packages lme4) and machine learning methods. These software and programs afford the user greater control and flexibility in tailoring complex hierarchical models. However, this level of control and flexibility places a higher degree of responsibility on the user to evaluate the robustness of their statistical inference. To determine how often biologists are running model diagnostics on hierarchical models, we reviewed 50 recently published papers in 2021 in the journal Nature Ecology & Evolution, and we found that the majority of published papers did not report any validation of their hierarchical models, making it difficult for the reader to assess the robustness of their inference. This lack of reporting likely stems from a lack of standardized guidance for best practices and standard methods. Here, we provide a guide to understanding and validating complex models using data simulations. To determine how often biologists use data simulation techniques, we also reviewed 50 recently published papers in 2021 in the journal Methods Ecology & Evolution. We found that 78% of the papers that proposed a new estimation technique, package or model used simulations or generated data in some capacity (18 of 23 papers); but very few of those papers (5 of 23 papers) included either a demonstration that the code could recover realistic estimates for a dataset with known parameters or a demonstration of the statistical properties of the approach. To distil the variety of simulations techniques and their uses, we provide a taxonomy of simulation studies based on the intended inference. We also encourage authors to include a basic validation study whenever novel statistical models are used, which in general, is easy to implement. Simulating data helps a researcher gain a deeper understanding of the models and their assumptions and establish the reliability of their estimation approaches. Wider adoption of data simulations by biologists can improve statistical inference, reliability and open science practices.
AB - Biologists routinely fit novel and complex statistical models to push the limits of our understanding. Examples include, but are not limited to, flexible Bayesian approaches (e.g. BUGS, stan), frequentist and likelihood-based approaches (e.g. packages lme4) and machine learning methods. These software and programs afford the user greater control and flexibility in tailoring complex hierarchical models. However, this level of control and flexibility places a higher degree of responsibility on the user to evaluate the robustness of their statistical inference. To determine how often biologists are running model diagnostics on hierarchical models, we reviewed 50 recently published papers in 2021 in the journal Nature Ecology & Evolution, and we found that the majority of published papers did not report any validation of their hierarchical models, making it difficult for the reader to assess the robustness of their inference. This lack of reporting likely stems from a lack of standardized guidance for best practices and standard methods. Here, we provide a guide to understanding and validating complex models using data simulations. To determine how often biologists use data simulation techniques, we also reviewed 50 recently published papers in 2021 in the journal Methods Ecology & Evolution. We found that 78% of the papers that proposed a new estimation technique, package or model used simulations or generated data in some capacity (18 of 23 papers); but very few of those papers (5 of 23 papers) included either a demonstration that the code could recover realistic estimates for a dataset with known parameters or a demonstration of the statistical properties of the approach. To distil the variety of simulations techniques and their uses, we provide a taxonomy of simulation studies based on the intended inference. We also encourage authors to include a basic validation study whenever novel statistical models are used, which in general, is easy to implement. Simulating data helps a researcher gain a deeper understanding of the models and their assumptions and establish the reliability of their estimation approaches. Wider adoption of data simulations by biologists can improve statistical inference, reliability and open science practices.
UR - http://www.scopus.com/inward/record.url?scp=85142241114&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85142241114&partnerID=8YFLogxK
U2 - 10.1111/2041-210X.14030
DO - 10.1111/2041-210X.14030
M3 - Article
AN - SCOPUS:85142241114
SN - 2041-210X
VL - 14
SP - 203
EP - 217
JO - Methods in Ecology and Evolution
JF - Methods in Ecology and Evolution
IS - 1
ER -