Identifying Genetic Factors in Disease with Big Data

Published

FaST-LMM speeds genome analysis (opens in new tab)

It’s long been known that many serious diseases—including heart disease, asthma, and many forms of cancer—run in families. Until fairly recently, however, medical researchers have had no easy way of identifying the particular genes that are associated with a given malady. Now genome-wide association studies, which take advantage of our ability to sequence a person’s DNA, have enabled medical researchers to statistically correlate specific genes to particular diseases.

Sounds great, right? Well, it is, except for this significant problem: to study the genetics of a particular condition, say heart disease, researchers need a large sample of people who have the disorder, which means that some these people are likely to be related to one another—even if it’s a distant relationship. This means that certain positive associations between specific genes and heart disease are false positives, the result of two people sharing a common ancestor rather than their sharing a common propensity for clogged coronaries. In other words, your sample is not truly random, and you must statistically correct for “confounding,” which was caused by the relatedness of your subjects.

Spotlight: Microsoft research newsletter

Microsoft Research Newsletter

Stay connected to the research community at Microsoft.

This is not an insurmountable statistical problem: there are so-called linear mixed models (LMMs), which are models that can eliminate the confounding. Use of these, however, is a computational problem, because it takes an inordinately large amount of computer runtime and memory to run LMMs to account for the relatedness among thousands of people in your sample. In fact, the runtime and memory footprint that are required by these models scale as the cube and square of the number of individuals in the dataset, respectively. So, when you’re dealing with a 10,000-person sample, the cost of the computer time and memory can quickly become prohibitive. And it is precisely these large datasets that offer the most promise for finding the connections between genetics and disease.

Enter Factored Spectrally Transformed Linear Mixed Model (FaST-LMM), which is an algorithm for genome-wide association studies that scale linearly in the number of individuals in both runtime and memory use (see FaST linear mixed models for genome-wide association studies (opens in new tab)). Developed by Microsoft Research, FaST-LMM can analyze data for 120,000 individuals in just a few hours, whereas the current algorithms fail to run at all at even 20,000 individuals. This means that the large datasets that are indispensable to genome-wide association studies are now computationally manageable from a memory and runtime perspective.

With FaST-LMM, researchers will have the ability to analyze hundreds of thousands of individuals to look for relationships between our DNA and our traits, identifying not only what diseases we may get, but also which drugs will work well for a specific patient and which ones won’t. In short, it puts us one step closer to the day when physicians can provide each of us with a personalized assessment of our risk of developing certain diseases and can devise prevention and treatment protocols that are attuned to our unique hereditary makeup.

David Heckerman (opens in new tab), Distinguished Scientist, Microsoft Research Connections; Jennifer Listgarten (opens in new tab), Researcher, Microsoft Research Connections

Learn More