Unsupervised Discovery of Ancestry Informative Markers and Genetic Admixture Proportions in Biobank-Scale Data Sets

Author: Seyoon Ko, Postdoctoral Scholar and IDRE Fellow, Public Health – Biostatistics

This project is to develop of a method for identifying ancestry-informative markers and improving the widely-appreciated ADMIXTURE software using a modern clustering and feature ranking method for scalable admixture proportion estimation. Recently, the scale of data in the field of genetics has quickly grown, and the dataset people are working on is now as large as 500,000 samples x 1,000,000 SNPs. In analyzing this scale of data, scalable algorithm and carefully managed utilization of computational resources is essential to perform the analyses in a reasonable time. In computational genetics, estimating the admixture proportion of ancestries has been a difficult task at the aforementioned scale. Our group developed a standard method for this type of estimation in 2009, and received over 5,000 citations on Google Scholar. However, it needs to work on an unprecedented scale with the availability of massive data. This project aims to select the most informative biomarkers based on a method that simultaneously performs feature ranking and clustering, then using the result to estimate the proportion of ancestry with an improved implementation of ADMIXTURE in the Julia programming language.
The first objective was to implement the algorithm efficiently with the memory-efficient data format (SnpArrays.jl) using Advanced Vector Extensions (AVX) or graphics processing units (GPUs). The result is already an order of magnitude faster than the original C++ version on a smaller scale using advanced computational techniques such as recursive tiling based on cache-oblivious algorithm. The next step is to make it work on a larger-scale data in reasonable time by distributed computing environments using Hoffman2 for simulated data, or virtual clusters on a cloud with special security measures for restricted data.
This project is expected to be one of the earliest projects using distributed computing with communication between processes in Julia on Hoffman2. It will set a precedence for this type of application.