Podcast recap: Andrea Ganna on applying polygenic scores and EHRs in healthcare

Written by Sano Marketing Team | Mar 24, 2026 1:48:16 PM

On the latest episode of The Genetics Podcast, Patrick welcomed Dr. Andrea Ganna, an Associate Professor at the Institute for Molecular Medicine Finland (FIMM), part of HiLIFE at the University of Helsinki, an Associated Faculty member at the ELLIS Institute Finland, and a Research Associate at Massachusetts General Hospital and Harvard Medical School. They discussed a shift that many in biomedical R&D are now grappling with: after years of work in large-scale genetics, polygenic scores, and biobanks, his focus is expanding toward electronic health records (EHRs) and foundation models built on health system data.

The conversation covered where polygenic scores are useful, where they fall short in clinical trials, and why EHR data may become a more important layer for predicting disease trajectories, treatment patterns, and healthcare utilization. For teams working in genomics, drug development, and real-world data, the episode offered a practical view of how these data types may fit together.

Shifting from polygenic scores toward EHR-based models

Andrea explained that genetics remains uniquely valuable because it captures information that is largely independent of environment. In his view, this makes genetics one of the few truly complementary data layers in medicine.

On the other hand, he argued that the clinical value of polygenic scores depends heavily on the question being asked. For disease prediction and screening, polygenic scores can work well. For treatment response and disease progression, the picture is less convincing.

As Andrea put it, “when we think instead about genetic use in a clinical setting, which is about treatment response and progression, there is where we see that maybe the value is not as large as we might have hoped.”

That limitation is one reason EHR data has become more central to his research. EHRs capture the actual sequence of care over time, including diagnoses, medications, comorbidities, and utilization patterns. That makes them especially relevant for questions about what happens after diagnosis, how patients move through treatment, and how healthcare systems respond.

What polygenic scores can and cannot do in clinical trials

Andrea made a valuable distinction between two different uses of polygenic scores in trials.

The first is prognostic enrichment, where a score helps identify people at higher risk of a trial outcome. The second is predictive enrichment, where a score identifies people more likely to benefit from a treatment.

Andrea’s view was that prognostic enrichment is often feasible when strong polygenic predictors already exist for the endpoint. Predictive enrichment is much harder.

The main problem is that many trial endpoints are highly specific and do not map cleanly to the kinds of phenotypes available in large population biobanks. Trials often focus on biomarker decline, composite endpoints, or disease progression measures that have not been studied at the scale of genome-wide association studies (GWAS).

This is an important point for clinical development teams. It suggests that the bottleneck is not just model performance, but also phenotype availability and endpoint alignment between discovery datasets and real trial design.

Why EHR foundation models matter

Andrea described foundation models for healthcare as a conceptual shift away from predicting one binary outcome at a time and toward modeling a patient’s future sequence of events.

Rather than building a separate model for each endpoint, these systems aim to learn from the longitudinal structure of medical data itself. That includes both what happens next and when it happens.

This creates potential for modeling treatment trajectories, medication persistence, resource planning, and downstream healthcare use. Andrea argued that this may be more valuable for health systems than simply estimating whether someone will develop a disease. In this regard, the strongest near-term use cases may sit closer to operational prediction and care pathway modeling than to broad disease screening.

The role of causal knowledge in medical AI

One of the most important themes in the conversation was that observational data alone can teach the wrong lesson.

Andrea gave the example of statins. In observational data, statin use is associated with higher cardiovascular risk because statins are prescribed to high-risk patients. A model trained naively on that data can infer the wrong causal direction.

This is where he sees an opportunity to combine observational data with external biomedical knowledge. That could include trial evidence, causal inference methods, Mendelian randomization, and language-model-based representations of medical knowledge.

This matters for anyone building AI on real-world data. Performance on next-event prediction is not the same as learning the right clinical intervention logic. Models may need explicit causal structure or external knowledge if they are going to support treatment decisions rather than only describe patterns.

What is holding healthcare foundation models back

Despite the excitement around medical AI, Andrea was clear that the limiting factors are not just algorithms.

He pointed first to legislation and data access. Large-scale health data reuse remains legally and operationally difficult, even in systems that are relatively advanced. In addition, the processing of cleaning and harmonizing data is essential but often underfunded. Secure computing environments add another layer of complexity, especially when teams want to train or adapt modern AI models in restricted settings.

Even once a model is trained, there is another challenge: how to move it safely out of a secure environment without exposing sensitive information.

This is an important reality check. In healthcare AI, the core bottlenecks are often governance, infrastructure, and deployment, not only model architecture.

A useful lesson for proteomics and biomarker discovery

Andrea described recent work showing that removing the genetic component from protein measurements can actually improve disease prediction in many protein-disease pairs.

That is a striking result because it suggests many protein signals are not causal drivers of disease. Instead, they may reflect environment, confounding, or downstream disease processes.

This means a strong predictive signal does not necessarily imply mechanistic relevance. In the context of clinical research, it may still be useful for stratification or risk prediction, but it should be interpreted carefully in drug discovery.

Key takeaways for drug development

Several ideas from this episode are especially relevant for clinical development teams.

First, genetics can still add value in trials, but often more as a complementary tool than as a standalone enrichment strategy. Andrea described how genetics can support trial emulation by acting as a quality control layer, helping researchers detect imbalance or hidden confounding.

Second, real-world trial emulation remains promising, but sample size and comparator selection are major constraints. Even in very large biobanks, matching procedures can dramatically reduce usable cohorts.

Third, the most useful future models may combine genetics, EHR data, and causal methods rather than treating any one of these as sufficient on its own.

For sponsors thinking about decentralized trials, external controls, or post-approval evidence generation, that combination could become increasingly important.

Listen to the full episode below.

View full post