I work in the field of computational biology. I apply advanced machine learning and statistical modeling techniques on massive amounts of biomedical data. In my Ph.D. work I focused on building and refining models of molecular networks and in applying these models to address biological and medical problems.
Using the Molecular Network to Analyze Common Hereditary Disorders
Common complex hereditary disorders are believed to be both multifactorial and heterogeneous. Traditional methods of genetic analysis, which work successfully on "simple" Mendelian disorders in which variation at a single genomic locus almost deterministically influences whether the individual will get the disease, fail when applied to complex (multifactorial and heterogeneous) diseases. We propose an extension of the classical linkage formalism based on an intuitive assumption that the loci that contribute to the risk of given common disorders are functionally related and, therefore, the corresponding genes and proteins form a tight cluster in a molecular network.
There are multiple sources of information about molecular interactions, with the scientific literature and large scale yeast two-hybrid assays being among the most utilized today. These two sources of knowledge describing the same biological phenomena have distinct respective biases and error patterns. For example, biologists tend to publish studies on areas of the network that have been popular at some time (e.g., neighborhoods of genes implicated in certain human diseases), while the high-throughput yeast two-hybrid screen is not equally easy to apply across all proteins (e.g., the yeast two-hybrid method is known not to work for membrane proteins). Hence, it is reasonable to hypothesize that describing and analyzing the two data sources in the context of a unified probabilistic model will lead to better coverage and potentially higher quality networks. We built a composite model with separate probabilistic components describing the discovery and publication of statements about molecular interactions in the literature, the process of generating results from high-throughput experiments, and a model of protein-protein interactions. The protein-protein interaction component connects the whole model and is based on features of the proteins. Fitting the joint model to the available data allows us to estimate the error rates associated with the text-mined and yeast two-hybrid datasets and predict novel protein-protein interactions.
Biomedical Text-mining
The Biomedical literature is a rich source of unstructured data about molecular interactions. Both the tremendous growth of the scientific literature and the need to look at the biological systems at a global/system level call for automated approaches for extracting information locked in text form. In our group we have developed the GeneWays system which addresses this need.
Any text-mining system is bound to make mistakes. We have developed a statistical approach to automate the identification of facts incorrectly extracted by the rule based GeneWays system. This ability can direct the future improvement of the system and, more importantly, can substantially improve the precision of the system.
A more serious problem is that not all published statements are correct. An indication that such errors exist is given both by the fact that some papers are eventually retracted and by observed inconsistencies between statements from different articles. In particular, in our automatically extracted data, we observe sets of statements about a particular interaction containing both positive (claiming the two entities interact) and negative (claiming that the two entries do not interact) assertions. We developed global statistical models describing the database of all extracted facts which allowed us to resolve such inconsistencies and also to uncover curious trends in the collaborative dynamics of a scientific community. For example, we observed with high statistical confidence that already published statements do influence the interpretation of current experimental results.