Join us to hear from Janssen about some of the most exciting biological datasets in recent years have originated from large-scale biobanking efforts. These high-dimensional datasets include electronic medical records, imaging, and genomic profiles from hundreds of thousands of individuals, significantly increasing the power to understand the risk factors and genetic basis of disease. However, working with petabyte-scale datasets in an efficient and scalable manner is non-trivial and requires careful planning to generate, store, and analyze. In this seminar, we outline some of the common architecture considerations for working with biobank data on-prem and in the cloud. We illustrate these by providing examples of real-world workflows using whole genome sequencing data from approximately 150,000 individuals in the UK Biobank. Specifically, we describe a genotype filtration technique for variant selection, which is straightforward with a modest number of samples but computationally challenging at the scale of biobanks, and a bioinformatic approach for interrogating regions of the genome where duplications hinder the mapping of short-read sequencing data. We take advantage of the San Diego Supercomputer Center’s Expanse cluster to prototype and execute analyses at-scale, leading to insights that support drug development and enable discoveries that improve human health.
Speakers:
Brice A. J. Sarver, Senior Scientist, Computational Genomics, Janssen R&D
Hussein Hijazi, Scientist, Computational Genomics, Janssen R&D