My group’s research is in the broad area of bioinformatics. We develop computational methods that decipher how properties of organisms, such as how they look and function, are encoded in the DNA. Our research has specifically focused on the parts of DNA that control the activities of genes – the so-called regulatory DNA. We use techniques from a variety of disciplines including machine learning, statistics and physics to model regulatory DNA and predict its biological function. I have collaborated extensively with biologists to understand the role of regulatory DNA in several important life processes, including embryonic development, social behavior, and cancer.
Our recent work has focused on understanding how differences in regulatory DNA lead to differences among individuals in terms of their biological properties, including predisposition to diseases and response to treatments. We have published novel methods that combine ideas from statistical mechanics and machine learning to predict the functional consequences of mutations in DNA, and to use such predictions to explain why certain individuals are resistant to cancer drugs. We have also developed specialized algorithms, based on probabilistic models, to chart the gene networks that underlie individual differences in survival rates among cancer patients or those that underlie cancer progression. As co-Director of an NIH Center of Excellence for Big Data to Knowledge (knoweng.org/ 2014-19), I led a team of over 30 researchers and programmers to build an entire suite of computational tools, available as a Cloud-based web platform, for analyzing and visualizing multiple types of genomics data and extract useful biological insights from them. As a co-PI and a Thrust lead of the NSF AI Institute for Chemical Synthesis (https://moleculemaker.org/), I recently led development of machine learning approaches to optimize biosynthesis or chemical synthesis strategies.
Focus areas of our future work will include (1) multi-omics – where we develop rigorous analytical approaches to combine multiple types of molecular data, e.g., genomics, transcriptomics, epigenomics, metabolomics, to infer a coherent picture of the underlying cellular biology, and (2) spatial omics – where we analyze transcriptomics and other omics data at the sub-cellular resolution to understand dynamic processes shaping the spatial distribution of molecules. Research into these topics will aim to understand the changes accompanying a biological process such as disease progression or behavioral responses, and how the DNA encodes the program for such changes. We will use methods of machine learning and deep learning as well as probabilistic models and biophysical models, separately and in combination, to tackle these challenging problems.