Podcast recap: Jonathan Marchini on scaling statistical genetics from HapMap to millions of exomes

The Genetics Podcast featuring Jonathan Marchini

On the latest episode of The Genetics Podcast, Patrick sat down with Jonathan Marchini, Head of Statistical Genetics & Machine Learning at the Regeneron Genetics Center (RGC). Before joining Regeneron, Jonathan spent two decades at the University of Oxford, helping shape the tools and collaborations behind the HapMap, 1000 Genomes, UK Biobank, and more. This conversation traces his path from teaching math in rural Tanzania to building large-scale methods that power today’s drug discovery.

An unconventional path into genomics

Jonathan’s route to genetics was not linear. After receiving a math and statistics degree, he spent three years teaching A-level math in rural Tanzania, an experience he describes as life changing. He then obtained a PhD in statistics at Oxford, where he worked on brain imaging. With the Human Genome Project complete and technology accelerating, he pivoted into genomics.

HapMap and the birth of imputation

Early genetics consortia answered basic questions the field did not yet know how to ask at scale. HapMap provided one of the first high-resolution maps and enabled fine-scale recombination estimates, later extended by projects like 1000 Genomes. Practical insight emerged, such as the ability to predict untyped variants based on local linkage disequilibrium (LD) from a reference panel. That foundational idea of imputation helped unlock GWAS chips that became powerful, affordable, and globally scalable

As these tools spread, another challenge became clear. Breakthroughs only matter if researchers can actually apply them. Jonathan emphasizes that methods must be paired with accessible, well-engineered software. Great papers are not enough without robust, usable code, and even today the field still underinvests in product-grade engineering, which slows the translation from idea to insight.

Scaling to millions

At Regeneron, the challenge is scale. Millions of exomes and tens of thousands of phenotypes across health systems and biobanks cannot be handled efficiently with traditional linear mixed models. Jonathan’s team developed more efficient approaches, such as methods for polygenic conditioning that compute in low memory, reuse computation across phenotypes, and flexibly run burden tests with fast-swappable annotations. They also created REME, a summary-statistics meta-analysis engine that uses LD references smartly to avoid re-running everything when annotations change. Together, these innovations make it possible to perform industrial-scale association studies at a faster pace.

RGC’s million-exome analysis showcased rare coding variation at unprecedented resolution. Roughly 5,000 genes had at least one human knockout, while around 4,000 genes appeared depleted of predicted loss-of-function variants, suggesting essentiality. About half of all predicted loss-of-function variants were singletons, only recoverable through sequencing rather than imputation. Since then, RGC has continued to scale and is now approaching three million exomes with collaborations that pair sequencing capacity with deep phenotypes from healthcare systems and biobanks.

Exome plus imputation versus whole genome sequencing

For common and low-frequency coding signals, Jonathan argues that exome plus array plus imputation delivers close to the same yield as whole genomes but at a fraction of the cost. That efficiency allows sequencing of three to four times more samples, which usually wins on statistical power. Whole genome sequencing adds very rare non-coding variants, but these are harder to interpret, and large-scale studies have so far produced relatively few compelling associations in that space. For rare disease diagnostics, WGS remains compelling, but for target discovery, exome-first still looks like the best investment.

Polygenic risk scores

Polygenic risk scores (PRS) can clearly stratify risk, but broad clinical adoption will be gradual. Jonathan sees immediate value in clinical trial design, selecting higher-risk individuals to reduce trial size and duration, and examining whether treatment effects differ across PRS strata. He expects effects to be graded across the PRS spectrum rather than confined to the highest-risk groups, and believes the biggest wins will come from disease-specific applications rather than broad population-wide deployment.

AI and machine learning

Deep learning often shines where the phenotype is complex and the mapping is nonlinear, for example extracting rich traits from MRI, CT, or retinal images. Jonathan’s team uses deep learning routinely for imaging and is experimenting with transfer learning to enhance protein variant annotation using large proteomics datasets. But for many complex traits, surprisingly simple linear models still perform just as well. The pragmatic takeaway, he suggests, is to start with the research question and data, then choose the simplest model that works, and reserve deep learning for problems that truly require it.

Looking forward, Jonathan is excited about longitudinal phenotyping and proteomics. Repeated imaging, richer clinical trajectories, and long-term follow-up will be essential for developing biomarkers that can make drug trials faster and more precise. At the same time, proteomics at population scale, such as the assays now rolling out in UK Biobank, could greatly improve variant interpretation and help connect statistical associations to biological mechanisms.

One of the clearest examples of this approach is Regeneron’s work on GPR75, where predicted loss-of-function variants are associated with about two BMI units lower weight and roughly 50% lower obesity risk. That finding has already seeded a drug program. For Jonathan, it is a reminder that coding variation with clear biology remains a powerful engine for target discovery, and that well-engineered methods plus very large cohorts can turn statistical signals into therapeutic hypotheses.

Listen to the full episode below.

Get in touch