Ancestry Prediction Tool • mixder

Ancestry Prediction

Background:

The mixture deconvolution algorithm requires allele frequency data. While we’ve found that using global allele frequencies generally performs quite well, using allele frequencies as closely matched to the population group of the contributor can result in increased accuracy.

Given forensic samples are almost always from individuals of unknown ancestry, we’ve developed a method using the inferred genotypes from the mixture deconvolution method to predict the ancestry of each contributor using principal component analysis (PCA).

PCA uses the genotypes from individuals of known ancestry, in this case the 2,504 1000 Genomes samples, to create a statistical framework for predicting the ancestry of the unknown sample. PCA transforms high dimensionality data (here, the genomic data) into a lower-dimensionality space in the form of principal components, or PCs. PC1 explains the highest amount of variance in the data, PC2 explains the second highest amount of variance, and so forth. Using genetic data, the biogeographical ancestry is driving the top PCs given it accounts for most of the variation in the data.

Plotting the PCs against each other (and coloring samples by population) allows one to visually see the separation and clustering of populations, oftentimes down to the subpopulation level. Where the unknown sample falls within the plot (along with some distance calculations) allows the user to assign a population to the sample, assuming it falls within a known cluster.

Generally, ancestry can be visualized by plotting PC1 vs. PC2. However, we found that adding the 3rd PC in a 3-dimensional space provides the best separation of the populations. MixDeR creates 3-D ancestry plots and saves them as a .html file. These interactive plots are more insightful than the standard 2-D PCA plots.

MixDeR calculates the Euclidean distance from the centroid of each superpopulation to the unknown sample and rank the superpopulations based on the distance with the smallest distance on top. It should be noted here that the top ranked superpopulation does NOT always match the actual ancestry of the unknown but instead lists the closest population distance-wise. The user must consider the actual value of that distance and examine the accompanying PCA plot to determine the potential accuracy of the prediction. If the unknown sample is falling outside of the top-ranked population, it likely does not closely match that population. While this can be a reflection of the quality of the sample and/or PCA, it could also occur because the unknown ancestry does not match those in the 1000 Genomes database. For example, Ashkenazi Jews are not included in the 1000 Genomes dataset and given their genetic divergence from other European populations, would fall outside of the EUR superpopulation (at least that’s what we’ve found!). All of these factors must be considered when evaluating the ancestry prediction results.

MixDeR does not make its own ancestry determination, but provides the data and plots for the user to make the determination themselves. Analyzing samples of known ancestry is pertinent to understanding the strengths and weakness of the method and the developers strongly encourage users to perform validation studies before making any ancestry predictions.

Running the Ancestry Prediction tool

When first launching the Shiny app, the ancestry prediction tool first appears. This tool can be skipped by checking the Skip Ancestry Prediction Step box.

The user must select whether they want to use the Superpopulation groups (AFR, AMR, EAS, EUR, and SAS) and/or the subpopulation groups (e.g. Toscani in Italy, Esan in Nigeria, etc.). If both are selected, PCA plots (coloring by either superpopulation or subpopulation) and centroid calculations are performed for both.

The ancestry prediction tool will first perform mixture deconvolution using the 1000 Genomes global allele frequency data and perform genotype filtering using the specified settings (allele 1/2 probability thresholds, # of SNP bins, the static/dynamic AT, minimum # of SNPs).

A conditioned and/or unconditioned deconvolution can be performed. A sample manifest and folder containing the mixture Sample Reports (as well the reference Sample Reports, if performing a conditioned deconvolution) are required. Please see the section entitled “Running Mixture Deconvolution” for further information and guidance on these settings.

The user can select whether to use the only 94 ancestry SNPs or all autosomal SNPs for PCA (the deconvolution step is performed using all SNPs). While the 94 ancestry SNPs do a satisfactory job with good quality samples and is quite fast, using all SNPs provides better clustering but is significantly slower (several minutes per contributor). It is important the user weights the benefits vs. drawbacks of each set of SNPs to determine the best option for their data and system.