Research in our lab


Biomedical Text Mining

The research in the biomedical text mining group of our lab focuses on text mining of biological or biomedical text. Much of our work is on information extraction from biomedical text, including named entity recognition and event extraction. We utilize a strongly ontology-based approach, and are committed to using community ontologies in our biomedical text mining work (such as the Gene Ontology, or other OBO ontologies) to the extent possible. We collaborate extensively with the National Center for Biomedical Ontology.

NLP Validation

We will develop methods for automatically mining the full text literature to validate computational predictions of functional sites in proteins. Our overall approach is to integrate predictions of protein functional sites derived from structural modeling with information extraction from text to enable identification of statements supporting or refuting a prediction in the literature.


We are building a set of tools for information extraction (in our case, focused on biomedical information extraction or BioNLP but not limited to that use) based on the paradigm of semantic parsing. Our system is called "OpenDMAP". DMAP stands for Direct Memory Access Parsing and draws on the idea that an ontology can be used to constrain the search for information. It utilizes the semantics of the ontology and enables the definition of semantic grammars for information extraction from text. OpenDMAP is available on Sourceforge here.

UIMA BioNLP tools

We are building a suite of Java tools in the Unstructured Information Management Architecture (Apache UIMA) for natural language processing, specifically tailored to the requirements of processing of biological or biomedical texts.  The BioNLP tools are available on Sourceforge here.


The Hanalyzer is an open-source data integration system designed to help biologists explain results observed in genome-scale experiments and to generate new hypotheses. It combines information extraction techniques, semantic data integration, and reasoning and facilitates network visualization. The Hanalyzer is available on Sourceforge here.


Philip Ogren has developed a plug-in for the Protege Ontology development environment that supports mark-up (also known as labeling or annotation) of text with Ontology concepts. This is very useful for creation of annotated corpora for natural language processing by domain experts. Knowtator is available on Sourceforge here.


Greg Caporaso developed an open-source, high-performance information extraction system for extracting mentions of point mutations from free text. Mutation Finder is available on Sourceforge here.