Collaborative Research: Combining Heterogeneous Data Sources to Identify Genetic Modifiers of Diseases

Project: Research project

Project Details


One of the most important approaches to understanding the causes and finding treatments for human disease is the study of human genetics. Genome Wide Association Studies (GWAS) have been used to identify regions of the genome associated with common diseases and disorders. However, genetic variants identified through these approaches usually explain only a small fraction of their known heritability and have yielded a poor record of finding disease-causing variants. This project will develop tools to combine GWAS with other sources of data, such as family-based genetic studies that identify important rare variants and transcriptomic or proteomic studies that capture gene expression signatures in disease, to find genetic modifiers that would be entirely missed using GWAS alone.

This project will identify genes involved in disease progression by combining information across different experimental types. The fundamental building block of the family of statistical models that will be employed is a hierarchical three-group mixture of distributions. Each gene is modeled probabilistically as belonging to either a null group that is unassociated with disease progression, a deleterious group that is associated with negative disease outcomes, or a beneficial group that is associated with positive disease outcomes. This three-group formalism has two key features. First, by apportioning prior probability of group assignments with a Dirichlet distribution, the resultant posterior group probabilities automatically account for the multiplicity inherent in analyzing many genes simultaneously. Second, by building models for experimental outcomes conditionally on the group labels, any number of data modalities may be combined in a single coherent probability model, allowing information sharing across experiment types. These two features result in parsimonious inference with few false positives, while simultaneously enhancing power to detect signals. The model disease for applying the combined analysis approach will be Parkinson?s Disease (PD). Genomic sequences from PD and control patients will be jointly analyzed along with transcriptomic data from public sources and targeted single nucleotide polymorphism (SNP) array data. In addition, a powerful imaging approach called robotic microscopy (RM) will be used to functionally evaluate the predictions of the statistical model thereby providing experimental feedback to the model. Using human neurons derived from PD patient induced pluripotent stem cells (i-neurons) and RM, levels of genes predicted to be beneficial or deleterious will be modulated in the PD i-neurons, and mitigation or exacerbation of disease phenotypes will be quantified to validate or invalidate predictions of the statistical model. The strategy of combining genomic, transcriptomic, phenotypic, and potentially other sources of information using the three-groups framework can be applied to any heritable disease with multiple data types available. The analytical approach in this project will help identify which genes are likely to play a role in pathogenesis, resulting in therapeutic targets and potentially individualized 'precision medicine'. This could lead directly to treatments for PD, and in addition could provide a useful set of tools for other researchers to pursue therapies for other heritable diseases.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Effective start/end date8/1/187/31/23


  • National Science Foundation: $250,000.00


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.