Dynamically scalable accessible analysis for next generation sequence data

Project: Research project

Description

DESCRIPTION (provided by applicant): Project Summary Wide availability of "next-generation" sequencing (NGS) instruments has enabled any investigator, for a modest cost, to produce enormous amounts of DNA sequence data. However, working with these raw sequences presents significant problems for individual investigators, small labs, or core facilities. For an experimental group with no computational expertise, simply running a data analysis program is a barrier, let alone building a compute and data storage infrastructure capable of dealing with NGS data. Fortunately, a computational model - "Cloud computing" - has recently emerged and is ideally suited to the analysis of large- scale sequence data. In this model, computation and storage exist as virtual resources, which can be dynamically allocated and released as needed. Importantly, cloud resources can provide storage and computation at far less cost than dedicated resources for certain use cases. However, formidable challenges need to be addressed to make these resources available to individual investigators. Specifically, although cloud computing provides a way to acquire computational resources on demand, the resources provided are either virtual machines on the Internet or specific programming libraries, which are unusable for experimentalists. Thus, a viable analysis solution needs to be accessible and deployable without informatics expertise;it must efficiently and automatically use dynamically scalable resources, while taking into account time and cost;it must include appropriate analysis tools and easily support addition of new tools as they emerge. We have previously developed a software system - Galaxy (http://galaxyproject.org) - that provides a robust framework for addressing these needs. Here we propose to significantly extend this framework to allow any experimentalist to perform large-scale NGS analyses utilizing the power of cloud computing infrastructure. In particular, we will modify the existing Galaxy framework to run entirely within the cloud. We will adapt the way Galaxy schedules and executes jobs to make effective use of cloud-style. We will provide a mechanism for individual users to create and deploy custom Galaxy instances on a cloud through an entirely web-based interface. Finally, we will test our approach by applying the developed facilities to the existing human re- sequencing data in order to uncover hidden patters of mutations causing human genetic disease on a very large scale. PUBLIC HEALTH RELEVANCE: Project Narrative Increasingly available and inexpensive high-throughput DNA sequencing holds great promise for biomedical research, but informatics challenge block the full realization of the potential of this transformative technology. In particular progress is limited by the informatics and engineering expertise of biomedical researchers, and the availability of sufficient computational infrastructure to analyze these enormous datasets. This project will address these problems by bringing together Galaxy, a system for making complex computational analysis accessible and reproducible, with "cloud computing", an infrastructure model where computing resources are purchased on demand as needed, making it possible for investigators with no informatics expertise to perform data-intensive analysis using cloud resources.
StatusFinished
Effective start/end date9/25/097/31/12

Funding

  • National Institutes of Health: $734,945.00
  • National Institutes of Health: $780,798.00

Fingerprint

Cloud computing
Availability
Costs
DNA sequences
Computer programming
DNA
Throughput
Internet
Data storage equipment