Sano x Benevolent AI: Combining machine learning with medical data

Over the past year, Sano and BenevolentAI have collaborated to demonstrate how an integrated, decentralized platform — combining patient recruitment, at-home genetic testing, electronic consent, and medical record retrieval — can generate a linked genetic and clinical dataset at scale for a genetically complex indication such as ulcerative colitis (UC). The project was partly funded by the UK government's innovation agency grant.

The project relies on a fully decentralized study design, removing the requirement for site visits and enabling participation from participants regardless of geographic proximity to clinical centers.

What follows outlines the study design, outcomes from the first phase, and implications for scaling this approach across additional disease areas.

Key Takeaways

Collaboration Goal: Sano Genetics and BenevolentAI are linking genetic and medical data to accelerate drug discovery for Ulcerative Colitis (UC).
Decentralized Approach: The study uses a "real-world" design where patients participate from home, removing the need for travel to clinical sites.
Phase 1 Success: The project successfully enrolled 619 participants, overcoming the challenges of fragmented U.S. medical records.
Data Integration: Researchers achieved a linked dataset of 272 participants with both genetic tests and medical records.

What is ulcerative colitis?

UC is one of the most common types of inflammatory bowel disease (IBD), the other most common type being Crohn’s disease. UC affects the large intestine, causing ulcers to develop along the colon's lining and inflammation of the colon and rectum, with patients experiencing abdominal pain, rectal bleeding, and other clinically significant symptoms.

It is known that there’s a large genetic component to IBD, with over 200 genetic risk factors identified to date. Some of the proteins encoded by these genes are involved in forming the epithelial lining of the colon, but most are part of the immune system, and much is still unknown about the pathogenesis of the disease. This level of genetic complexity, combined with limited understanding of disease progression, makes UC a strong candidate for machine learning approaches that can identify patterns across large, multi-modal datasets.

Why did Sano Genetics and BenevolentAI collaborate on this project?

In 2017, 6.8 million cases of IBD were recorded globally. This burden is expected to increase as IBD prevalence rises in newly industrialized countries, according to the same Global Burden of Disease analysis.

A significant proportion of patients do not respond adequately to available treatments, and the mechanisms driving heterogeneity in disease course and treatment response remain poorly characterised.

A linked dataset integrating genetic, clinical, and patient-reported information provides the foundation needed to identify novel drug targets, characterise drivers of disease progression, and develop biomarkers predictive of treatment response.

unnamed (1)

What are the goals and current status of the research collaboration?

This research collaboration brings together BenevolentAI’s expertise in machine learning and IBD with Sano’s expertise in collecting and linking patient genetic and medical data.

The goal of the project is to generate a linked genetic and medical record database designed for use in machine-learning applications to support more targeted drug and biomarker discovery in UC. This aligns with BenevolentAI’s commitment to developing an oral, small-molecule treatment with disease-modifying efficacy, and improved safety for patients with UC who do not respond to the current standard of care options.

Total Enrollment: 619 participants
Medical Records Retrieved: 399 participants
Genetic Test Completion: 68% of those with retrieved records
Final Linked Dataset: 272 participants

The first phase of the study was conducted in the United States, where medical records are fragmented and often challenging to access. This fragmentation is one of the core obstacles to building datasets that are structured and complete enough for machine learning applications. Despite this, we were still able to retrieve both electronic and paper medical records for more than 60% of participants.

How the study works from the participant’s perspective

The study was entirely decentralized via Sano's online platform, through which all participants provided:

electronic consent
detailed information about their symptoms
medical history
permission for medical record linkage

In parallel with the medical record retrieval process, participants completed at-home saliva sample collection, with genotyping conducted centrally and residual DNA stored for future sequencing.

This testing step was embedded within the same platform workflow used for consent, prescreening, and engagement — ensuring continuity across the participant journey and maintaining Sano's visibility into conversion at each stage.

Following the initial study phase, all enrolled participants remain active within the Sano platform. This creates a longitudinal cohort that can be updated over time through questionnaires and ongoing engagement, alerted to future research opportunities for which they are eligible, and recontacted compliantly as the programme expands or new studies are initiated. From a sponsor perspective, this transforms a single-study dataset into a durable patient asset — one that increases in analytical value as the cohort grows and follow-up data accrues.

Using machine learning for target discovery and patient stratification

In complex diseases like UC, where hundreds of genetic risk factors interact with clinical variables, identifying meaningful patterns requires analysis at a scale and speed beyond manual approaches. Machine learning is particularly suited to this challenge because it can detect subtle correlations across large, heterogeneous datasets that would otherwise remain hidden. BenevolentAI uses machine learning algorithms to integrate large volumes of scientific literature, along with patient-level data such as genetic and clinical data/medical records, to identify potential new drug targets and opportunities for precision medicine approaches such as patient stratification for a wide variety of diseases.

By exploring large, well-annotated patient-level datasets in which clinical details are linked to genetic information, researchers can better understand the efficacy of different treatment regimens and identify patients who are likely to respond, or not respond, to a given treatment for a given disease.

BenevolentAI has worked closely with Sano to guide the collection, formatting and ingestion of study data in a way that ensures it can be used efficiently in machine learning applications. As the database of enrolled patients grows, the models themselves improve, with each additional data point refining prediction accuracy and revealing new patterns. BenevolentAI plans to use this linked, genotyped cohort to investigate treatment response heterogeneity and define genetically or clinically distinct patient subgroups with differential outcomes.

Next steps

The successful completion of the first phase is a significant milestone for Sano Genetics and BenevolentAI. By uniting genetic and clinical data through a patient-focused, siteless study design, the project has produced a linked dataset structured for machine learning applications in precision medicine and genomics research. The framework underpinning this approach can be applied to a range of diseases beyond UC.

The decentralised model also expands geographic and demographic reach, reducing the access constraints that site-dependent studies impose. This has direct implications for dataset representativeness, which matters both for the validity of findings and for the generalisability of any identified biomarkers.

The first phase establishes a replicable model for building linked genetic and clinical datasets at scale in a decentralised setting. The next phase will expand participant reach, incorporate additional biomarker testing, and further develop the analytical infrastructure needed to support ongoing precision medicine research in UC. For sponsors with genetically stratified assets in IBD or adjacent indications, the operational framework and patient cohort built through this collaboration represent a foundation that can be extended rather than rebuilt.

Group 48095699 (2)