Implementing the Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL)

  • Nekrutenko, Anton (coPI)
  • Taylor, James Peter (PI)
  • Leek, Jeffrey T. (CoI)
  • Waldron, Levi David (CoI)
  • Carey, Vincent James (CoI)
  • Morgan, Martin (CoI)
  • Schatz, Michael (CoI)
  • Goecks, Jeremy (CoI)

Project: Research project


Project Summary: The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab?space (AnVIL) will power the next generation of computational genomic research. We will develop the AnVIL environment using the leading national?scale cyberinfrastructure as the foundation supporting the most widely?used analysis environments and frameworks vetted by NHGRI researchers. Our user?centered solution for data access, analysis, and visualization will enable investigators across all levels of expertise to fully utilize genomic datasets using environments they are already familiar with, leveraging well?engineered and optimized scientific computing infrastructure for greater efficiency and lower costs. Aim 1: Engineer the AnVIL Data and Compute Platform. We will leverage the TACC Science Cloud and the Agave Science?As?A?Service platform to deploy a cloud?based environment supporting the data storage, access, and compute needs of the NHGRI research community. Aim 2. Develop APIs for Data and Compute Access. To maximize the domain?wide impact of AnVIL, we will draw on community efforts and our own collective experience supporting diverse genomic analyses to define access standards and to design and implement AnVIL APIs. Aim 3. Build an AnVIL metaportal integrating widely used analysis platforms. We will create a single metaportal residing within TACC's Science Cloud providing a unified view of users' data and activities, provenance and billing, and access to several of the most widely used workbenches for genomic research. These workbenches include Bioconductor, Galaxy, the Genome Modeling System, Juypter, and RStudio. The metaportal will also provide access to the most popular genomic visualization tools. Aim 4. Develop novel data aggregation, indexing and query schemes to increase analysis efficiency and reduce cost. We will build approaches, including indexing and pre?computation of key statistics, to make better use of existing (e.g., TCGA, GTEx) and future large datasets with the goal of increasing data utility and decreasing the cost of posing scientific queries against massive datasets. Aim 5: Develop training and outreach infrastructure and materials. We will build support for training directly in the AnVIL platform, including tight coupling to MOOC style courses, self?directed training materials, and support materials for conducting online and in?person training workshops. Aim 6: Engage in effective project governance and assessment. We will establish a leadership and management structure involving key stakeholders from NHGRI, including program staff and the NHGRI appointed Data Steering Committee and External Advisory Committee. The key innovation of this work is our leveraging of existing hardware, software, and human resources to create a practical and pragmatic solution to the challenge of building the AnVIL.
Effective start/end date9/21/186/30/23


  • National Institutes of Health: $2,402,384.00


Application programming interfaces (API)
Space platforms
Natural sciences computing
Technical presentations
Data storage equipment