Sequencing is an experimental wet-lab technique to obtain information about the genome of an organism. The computational problem of genome assembly is to reconstruct the full sequence of the genome from sequencing data. Genome assembly faces several major challenges, including scalability, accuracy, and adaptability to new technology. In order to tackle these challenges, this project will develop new algorithms for assembly and explore their theoretical foundations. The developed algorithms and theory will make assembly tools more scalable and accurate and will enable assembly of data from emerging technologies. They will enable previously impractical assembly projects and allow biologists to perform assembly without needing expensive hardware. Genome assembly is becoming an increasingly important step in tackling some of our major societal challenges. For example, the study of plant genomes gives insights into sources of renewable energy, and helps identify the genome characteristics that can confer parasite resistance to plants. In epidemiology, the variability between the genomes of different pathogens strains or species can help pinpoint the geographic origin of a disease. Improving the quality of the human genome assembly will also help achieve the goals of the BRAIN Initiative Grand Challenge, by improving our ability to detect variations driving genetic disorders such as Alzheimer's, schizophrenia, autism and epilepsy. This project will make strides to tackling these major societal challenges by improving our ability to assemble the corresponding genomes. Due to its increasing prominence, it will be important to educate the public, students, and other researchers about genome assembly. A series of expository articles will be published in a general audience journal to educate the public about the role of genome assembly in societal issues, such as medical treatment and privacy. Bioinformatics education will be strengthened through K-12 and industry outreach, development and broad dissemination of teaching modules, and development of a new graduate course at Penn State. This project increases diversity through recruitment of underrepresented groups and engagement of undergraduate students. All educational and research outcomes will be made available freely to the public, and the software will be developed open source.
Genome assembly faces several major challenges, including scalability, accuracy, and adaptability to new technology. A deeper understanding of the theoretical principles underlying assembly has the potential to impact all the major challenges facing assembly. The main research goal of this proposal is to develop the theory, algorithms, and software to tackle these challenges. The proposal will develop new algorithms and tools for assembly and explore their theoretical foundations. The subgoals are to develop a scalable assembler, to develop a modular assembly framework, to create predictive models for guiding experimental design, to characterize the relationship between string and de Bruijn graphs, and to characterize the structure of sequencing overlap graphs. The developed tools will be applied to biological data. Techniques from the theory of hash functions, I/O and parallel optimization, graph theory and statistics will be used. The wide range of phenotypic diversity observed across the phylogenetic spectrum is largely attributed to the differences between each species' genome. In the study of many species, assembling a reference genome offers tremendous biological insight and is a crucial step toward understanding their genetic, functional, and evolutionary aspects. The methods and theory developed as part of this proposal will make assembly tools more scalable and accurate and will enable assembly of data from emerging technologies. They will enable previously impractical assembly projects and allow biologists to perform assembly without needing expensive hardware.
|Effective start/end date||2/1/15 → 1/31/22|
- National Science Foundation: $549,010.00