The proposed research harnesses parallelism to accelerate the pervasive bioinformatics workflow of detecting genetic variations. This workflow determines the genetic variants present in an individual, given DNA sequencing data. The variant detection workflow is an integral part of current genomic data analysis, and several studies have linked genetic variants to diseases. Typical instances of this workflow currently take several hours to multiple days to complete with state-of-the-art software, and current algorithms and software are unable to exploit and benefit from even modest levels of hardware parallelism. Most prior approaches to parallelization and performance tuning of genomic data analysis pipelines have targeted computation, I/O, or network data transfer bottlenecks in isolation, and consequently, are limited in the overall performance improvement they can achieve. This project targets end-to-end acceleration methodologies and uses emerging heterogeneous supercomputers to reduce workflow time-to-completion.
The project focuses on holistic methodologies to accelerate multiple components within the genetic variant detection workflow. It explores lightweight data reorganizations at multiple granularities to enhance locality, investigates compute-, communication-, and I/O task cotuning, locality-aware load-balancing, and coordinated resource partitioning to exploit high-performance computing platforms. A key goal of the proposed research is to design domain-specific optimizations targeting the massive parallelism and scalability potential of current heterogeneous supercomputers, so that the developed techniques can be easily transferred and applied to dedicated academic cluster and commercial computational environments. Outreach efforts target undergraduate students through recruiting workshops and attract them to interdisciplinary graduate programs. Curriculum development activities emphasize cross-layer parallelism.
Link to award abstract at NSF.gov