Using Machine Learning to Predict Rare Diseases

Biobanks – databases with genetic and health information – offer researchers the ability to explore illnesses and study the contributions of genetics and environment to disease trajectory. These investigations have enabled us to draw conclusions about factors ranging from the relationship between diet and disease to household size and COVID severity, yielding valuable insights to guide researchers, clinicians, and patients alike.

But biobanks are only as useful as the quantity and quality of the data in them. Incomplete information is often an issue in patient datasets, explains Stanford PhD student Lu Yang. “We might know the patient has been treated for type II diabetes, for example,” Yang says, “but if they have never been treated in the hospital in an in-patient setting, the term ‘type II diabetes’ may be missing from their data.” This missing information is a significant barrier for researchers who are conducting disease studies and looking for patterns that could lead to new breakthroughs.

To address this problem, Yang collaborated with recent Stanford postdoctoral student Sheng Wang and Russ Altman – Stanford HAI associate director and professor of bioengineering, genetics, medicine, biomedical data science, and, by courtesy, computer science – to create a model that can predict a comprehensive set of diagnosis codes – also called phenotype codes – for all the patients in the UK Biobank. This bank holds the data of half a million participants from the UK, including patients with rare diseases. By creating POPDx, a machine learning framework for disease recognition, the research team created a model that, according to Yang, “produces probabilities that a person might have certain diseases or phenotype codes.”

In fact, POPDx outperforms existing models in predicting common and rare diseases, including diseases that aren’t present in the training data. This is a significant finding, according to Altman. “While most machine learning approaches that use deep neural networks require a ton of training, we were very pleased that our approach using prior knowledge like text and taxonomy allowed us to recognize some diseases in our test set, even though we had never seen them before in training. This is important because while there is substantial data in medicine, it is not at the same scale as large IT companies, and so it is critical that we develop methods that can work on sparse data, and work well enough to help patients with uncommon diseases.”

Real Data From Real Patients

When embarking on this research, Yang considered the prior work of second-author Wang on the classification of cells. In that research, Wang used the Cell Ontology to predict a single correct cell type for all of the cells in the test set. Yang wanted to take a similar approach for POPDx, but for diseases. “I thought it would be cool to similarly leverage the relationships of diseases in the Human Disease Ontology to address disease recognition.” While Wang’s research was a one-vs.-all classification problem where only one cell type was predicted, Yang needed multiple labels. “Each patient can have multiple diseases, so we addressed it as a multi-label, multi-classification type of problem,” she says.

Another key difference in Yang’s work is the breadth of information she used. The POPDx model looks at a wealth of patient data, from demographic information and patient questionnaires to medical exams and EHR data. It even extracts information from physical data and lab tests. “Before this, most of the existing models needed well-curated datasets, which means they might not be able to look into the abundance of features that we are able to look into with our work,” she says. The large scale of Yang’s work directly translated to the wide range of disease codes the model could predict. “Usually research will be specific to a certain domain, like heart disease, so they’ll only look at that relevant information or codes. But for our study we tried to come up with a complete profile of the UK Biobank participants.”

Predicting Diseases Despite Small Datasets

The POPDx model works by looking for relationships between the patient’s data and disease information, using natural language processing and the Human Disease Ontology to make probabilistic decisions. “The biggest challenge for the model comes from diseases that we don’t see in the training or have little data for. As we know, most ML models rely on large datasets, but some of these diseases don’t have data,” says Yang.

POPDx’s solid performance with limited or even no data is extremely powerful, obviating the need for huge datasets. Yang was able to improve the AUPRC (precision metric for the model) for unseen and rare diseases by 218% and 151%. According to Yang, this means that if a clinical team needs to identify patients with a low-prevalence disease, “our model on average will increase the possibility of finding these positive cases. Before, they would have to go through a huge number of patients in the Biobank, but now they can screen a much lower number in order to find possible cases.” POPDx’s ability to recognize rare diseases provides a better starting point for clinicians and researchers looking to study those diseases.

One challenge Yang noted was the demographic skew of the UK Biobank, which is 56% female and majority white, and has an average age of 71. But the lack of diversity in the biobank is related less to data than to broad healthcare access. “The problem is that if someone doesn’t have access to healthcare, we don’t have their data,” says Yang. The researchers addressed this concern by introducing background information about the hierarchy and relationship between diseases, which gave the model a boost when dealing with unfamiliar diseases. Yang believes this strategy may also have added some randomness to the model and mitigated bias. Yang’s hope is that there will be more infrastructure in the future to enable integration of data across multiple biobanks, allowing for more diverse datasets.

The Future of Disease Prediction

As she looks to the future, Yang is interested in a time-series analysis of the patient data, which would look not only at the probability of having a disease but also when in their life a patient might have it. Another possible avenue is the integration of phenotype and genotype data in the model, which would give researchers an even more comprehensive perspective on diseases than they have now. Whatever the next step, Yang is committed to building inclusive models that work for everyone. “Whether a patient or a researcher, access to data is critical,” Yang says.

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.


Substack subscription form sign up