9 Aug, 2019

What is imputation?

Learn about what imputation is, how it works and what we use it for.

The human genome has about 6.2 billion letters of DNA. Whole genome sequencing can be used to sequence nearly all 6.2 billion bases, but it is still expensive - about $600 for consumers, and $300-400 at large research centres. In contrast, genotyping arrays only test a small number of sites (between 500,000 up to 2 million for most arrays, or 0.2% of the genome) but are much less expensive. This is the kind of technology used by nearly all major direct-to-consumer genetic testing companies, including 23andMe, AncestryDNA, and MyHeritage.

Genotyping arrays only test a small number of sites - on average about a few letters in every thousand - and they miss out a lot of information because they only focus on the areas where humans tend to be different from one another. A statistical technique called imputation can be used to ‘fill in the blanks’ for the letters in between. Imputation provides a ‘free upgrade’ to genotyping data, but it is far from perfect. Imputation is unreliable for rare genetic variants, so it will not transform a genotyping test into a whole genome sequence, but it is still a valuable tool.

Our imputation process at Sano accepts files from 23andMe, AncestryDNA, MyHeritage, or other providers (usually between ½ million to 1 million SNPs) and apply an algorithm called ‘EAGLE2’ which adds an additional 30 million SNPs.

How does imputation work?

There are a number of different imputation algorithms, and all of them require two main things:

  • Data from a large number of whole genome sequenced individuals (often called ‘reference panels’)
  • Genotype array data from an individual to be imputed

The imputation algorithms compares the SNPs in the genotype file to the large set of whole genome sequenced individuals and searches for matching segments. These segments are called ‘haplotypes’ and can be used to ‘fill in the blanks’ between SNPs.

Figure adapted from <a
href=”https://csg.sph.umich.edu/abecasis/publications/pdf/Annual.Review.Genomics.Hum.Genet.vol.10-pp.387.pdf” rel=”noopener noreferrer” target=”_blank”>Li et. al, 2009</a>

Figure adapted from Li et. al, 2009

This process is not perfect. The individual whose genotype is being imputed may have rare genetic variants that are missing from the reference pool, which will not be picked up. As a result, imputation is more reliable for detecting common variation, and the accuracy for rare variation improves as the size of the pool of whole genome sequenced individuals increases. As genotyping tests and imputation can have a high error rate for rare genetic variants, caution should be taken when interpreting rare genetic variants from this kind of test.

Likewise, imputation does not perform as well for people from ethnicities that are underrepresented in the pool of whole genome sequences (often called reference panels). While projects like the 1,000 Genomes Project have made efforts to sample a wide range of different world populations, the vast majority of research projects are done on people of European ancestry. Population-specific reference panels, as well as population-specific genotype arrays are needed to improve imputation quality and prevent bias.

What do we use imputation for?

Identifying and characterizing the genes that impact human traits and genetic conditions including depression, type 2 diabetes, and breast cancer, is one of the most central objectives of human genetics studies. Genome-wide association studies are one of the major tools used by human genetics researchers, and they often require tens of thousands or millions of people. Using imputation, researchers can evaluate and compare data from different providers or genotype chip versions in a more standardised format. For example, many studies use data from the UKBiobank which uses a chip made by the company Affymetrix as well as data from 23andMe which has used multiple different chips over the past decade made by the company Illumina.

An emerging technique called Polygenic Risk Scores (PRS) or Genomic Risk Scores (GRS) can also calculate risk for common genetic conditions included coronary artery disease, breast cancer, and depression on an individual level. We have covered polygenic risk scores in a few of our podcasts and blog posts including with Lasse Folkersen (which also discusses imputation) as well as with Joe Pickrell of Gencove, and Cathryn Lewis from Kings College London.

What does the future hold?

Genotyping and imputation is not a full replacement for whole genome sequencing, and as the price of whole genome sequencing continues to drop, we will see greater adoption. However, this will also improve the quality of reference panels, meaning the quality of imputation on rare variants will continue to improve.

Some studies have suggested that a large fraction of disease risk is hiding in rare genetic variants that have not been easily detected, but the jury is still out.

As a result, many will opt to continue using genotyping chips and imputation for years to come for genetic genealogy and large-scale studies of common diseases and traits. Others, including the UK National Health Service, have made a big bet on whole genome sequencing. Finally others may take the more pragmatic ‘middle ground’ of exome sequencing (which covers about 2% of the genome in great detail) coupled with genotyping and imputation which cover the remaining 98% ‘just enough’.

Follow us @sanogenetics

Discover the world of genetics

Join our community to learn more about your health and contribute to the development of medical research.

Sign Up

Related blogposts