BioNLP Resources

I have tried to compile a list of many of the freely available resources either used directly and cited or of potential use by researchers applying NLP/text mining to biomedical literature. This started as a collection of the resources used by participants in the BioCreAtIvE evaluation, but I have tried to extend it as I become familiar with new work and new resources. I have put in descriptions taken largely from the source webpages, but I have also included some reviews based on my own experiences or what I have heard from users other than the developers of the resource. In general though, the opinions are entirely my own. This is probably rife with errors and comissions, so please contact me to fix anything. -Alex Morgan


Text Processing Tools

Brill Tagger A part of speach tagger by Eric Brill which uses transformation based learning.This tagger has been used by many groups, with various entity types (eg gene names) being tagged as different parts of speech.
AbGene A gene name tagger developed by John Wilbur and Lorrie Tanabe.I got this tagger working with very little work. It uses a somewhat nonstandard input and output data format, but it is easy to munge most sources into this format.
YAGI: Yet Another Gene IdentifierYAGI is the little brother to ABNER. It is a command-line annotation tool that uses conditional random fields (CRFs) trained on the BioCreative Task 1a dataset to identify gene names in biomedical text. Unlike ABNER, it will annotate only gene names (gene products, such as protein and RNA, are also labeled as genes). Using exact boundary matching, it achieves relatively high precision (~75 on unseen test data), and decent recall (~65). This is roughly the state of the art right now.
ABNER: A Biomedical Named Entity RecognizerABNER is a machine learning system that uses a statistically trained linear-chain conditional random field (CRF) with a variety of orthographic and contextual features. The current version uses no syntactic information or lexicons. Performance is state of the art: it achieves an overall F1 score (using strict matching) of 89.3 on its training data, and 69.9 on unseen evaluation data (72.2 for the coveted "protein" entity). It is liable to perform as well for any mammalian molecular biology text.Burr Settles (Mark Craven's graduate student) has written this paper: "Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets." To appear in Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA). Geneva, Switzerland. 2004.
YamCha YamCha is a generic, customizable, and open source text chunker oriented toward a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP chunking, and Text Chunking. YamChauses a state-of-the-art machine learning algorithm called Support Vector Machines (SVMs), first introduced by Vapnik in 1995. YamCha is exactly the same system which performed the best in the CoNLL2000 Shared Task, Chunking and BaseNP Chunking task.There is some good documentation here, and it seems to have beem used by a variety of different researchers. It is available pre-compiled for Windows and Linux.
Mallet A Machine Learning for Language Toolkit by Andrew McCallum This toolkit seems pretty well documented, and much of the interest in using it is because of the inclusion of a tagger which uses Conditional Random Fields (CRF's).
TnT, Trigrams 'n Tags Trigram based tagger: The tagger is an implementation of the Viterbi algorithm for second order Markov models. The main paradigm used for smoothing is linear interpolation, the respective weights are determined by deleted interpolation. Unknown words are handled by a suffix trie and successive abstraction.Kevin B. Cohen had great things to say about using this tool: "We were impressed by its availability on a variety of platforms, its intuitive interface, and the stability of its distribution, which installed easily and never crashed."
Lucene Jakarta Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.I've used Lucene a little bit, and I've found it to be very straightforward. There is lots of customization available, and it can do clever things with pre-defined meta data (it takes in XML).
Grok Grok is a library of natural language processing components, including support for parsing with categorial grammars and various preprocessing tasks such as part-of-speech tagging, sentence detection, and tokenization. -- Grok is an open source natural language processing library written in Java. It is part of the OpenNLP project, and provides implementations for most of the interfaces defined in the opennlp.common package.
OpenNLPOpenNLP provides the organizational structure for coordinating several different projects which approach some aspect of Natural Language Processing. OpenNLP also defines a set of Java interfaces and implements some basic infrastructure for NLP components There is a fairly active discussion group at the sourceforge site as well as a mailing list.
Alias-i's LingPipe LingPipe is a suite of Java tools designed to perform linguistic analysis on natural language data. While fast and robust enough to be used in a customer-facing commercial system, LingPipe's flexibility and included source make it appropriate for research use. Version 1.0 tools include a statistical named-entity detector, a heuristic sentence boundary detector, and a heuristic within-document coreference resolution engine. Named entity extraction models are included for English news and English genomics domains, and can be trained for other languages and genres.We have been able to use it with few problems and have tagged hundreds of thousands of MEDLINE abstracts with it. Alias-i reported very competitive performance in BioCreAtIvE 2004's Task 1A with very little modification of the base system, requiring only a couple of days to set things up and train on the training set. The javadoc is comprehensive. Many of the java classes can be executed on the command line to do processing of xml files for things like tokenization and sentence tagging. The distribution comes with pre-trained named entity taggers for Task 1A and for the tagset in the GENIA corpus, for which they also have decent results on their webpage. There is a GUI provided which does processing and NE tagging using the GENIA tagset, which is very quick.
BioPython The Biopython Project is an international association of developers of freely available Python tools for computational molecular biology.There is a lot of stuff in here of use for those interested in bioinformatics/genomics, but one of the nicest aspects is an interface/parser for most of the major biological databases. The MEDLINE interface being the most relevant to text-mining.
Python NLTK (Natural Language Toolkit) NLTK, the Natural Language Toolkit, is a suite of Python libraries and programs for symbolic and statistical natural language processing. NLTK includes graphical demonstrations and sample data. It is accompanied by extensive documentation, including tutorials that explain the underlying concepts behind the language processing tasks supported by the toolkit. NLTK is ideally suited to students who are learning NLP (natural language processing) or conducting research in NLP or closely related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning. NLTK has been used successfully as a teaching tool, as an individual study tool, and as a platform for prototyping and building research systems. A good place to start for those learning about NLP for the first time, this has been used in many academic situations. It is extremely well documented, with tutorials which not only explain the tool, but also give an overview of the subject (eg document clustering). I was able to go from downloading it for the first time, to creating and training a 2004 Task 1A system (bigram gene name tagger) in about and hour. There are some performance issues in speed and memory usage inherent with continuously interpreted Python.
TinySVM TinySVM is an implementation of Support Vector Machines (SVMs) for the problem of pattern recognition. Support Vector Machines is a new generation learning algorithms based on recent advances in statistical learning theory, and applied to large number of real-world applications, such as text categorization, hand-written character recognition. TinySVM has been used by a variety of groups, generally to do document categorization.
SVMlight SVMlight is an implementation of Support Vector Machines (SVMs) in C.We have used SVMlight with good success. It is nice in that it implements sparse vectors well (useful or large vocabularly text classification). It also can handle large numbers of training examples. It is actually part of a larger family of software using SVM's for learning more complex (multivariate) output. Joachims is working on projects inlcuding PCFG's, sequence alignment and HMM"s as part of SVMstruct.
R Project for Statistical Computing R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories. There are some important differences, but much code written for S runs unaltered under R. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. R is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS. Although R doesn't do text processing by itself, it has been used for a variety of statistical tasks using the variety of packages available including document classification, document clustering (eg Craig Struble used the AGNES package to do hierarchical clustering of MEDLINE abstracts using MESH and term features), and non-parametric density estimation for feature weighting. I found using R to be an enjoyable experience. The only significant problem is that is so powerful and designed by fairly sophisticated statisticians that it can sometimes be hard to use. The R mailing lists though are extremely active and those who post are very knowledgeable and helpful.
TIGERSearch The TIGERSearch software let's you explore linguistically annotated texts. For example, a lexicographer or terminologist can use TIGERSearch to find out about lexical properties of a word like the collocations the word is used in. A linguist could employ TIGERSearch to obtain sample sentences for the syntactic phenomena he is interested in. Jasmin Saric has this to say: "It comes with several filters for different input formats. It proved to be helpful for grammar development in our experiments."
CASS by Steve Abney This is very well documented and has been widely used both in research and in education (student projects in NLP).
MedPost MedPost: a part-of-speech tagger for bioMedical text As described in this paper by Smith, Rindflesch and Wilbur it has 97% performance on GENIA POS tags. My subscription to Bioinformatics lapsed, but Aaron Cohen of OHSU tells me it is an HMM-based stochastic tagger. Being HMM based, it uses the Vitterbi algorithm to find the most likely tag sequence. (Thanks Aaron!)

Lexical Resources

HUGO This is nomenclature for many of the known human genes and includes synonym variants.
NCBI Organism Taxonomy NCBI Organism Taxonomy A source of the scientific and common names of all sorts of organisms.
Grady Ward's Moby Several dictionaries of English words and names. This has been used this to filter out common English words when tagging biological entities in text.
Swiss-Prot Swiss-Prot; a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.). This has been used by many groups as a source of gene names and synonyms.
KEGG KEGG is a pathway database, but has been used to extract the names of compounds.
UMLS The purpose of NLM's Unified Medical Language System (UMLS) is to facilitate the development of computer systems that behave as if they "understand" the meaning of the language of biomedicine and health. To that end, NLM produces and distributes the UMLS Knowledge Sources (databases) and associated software tools (programs) for use by system developers in building or enhancing electronic information systems that create, process, retrieve, integrate, and/or aggregate biomedical and health data and information, as well as in informatics research. By design, the UMLS Knowledge Sources are multi-purpose. They are not optimized for particular applications, but can be applied in systems that perform a range of functions involving one or more types of information, e.g., patient records, scientific literature, guidelines, public health data. The associated UMLS software tools assist developers in customizing or using the UMLS Knowledge Sources for particular purposes. The lexical tools work more effectively in combination with the UMLS Knowledge Sources, but can also be used independently.
IUPAC & IUBMB: Biochemical & Organic Nomenclature INTERNATIONAL UNION OF BIOCHEMISTRY AND MOLECULAR BIOLOGY - Recommendations on Biochemical & Organic Nomenclature, Symbols & Terminology etc. Lots of information about the ways all sorts of biologically revelevant compounds (eg various enzyme types) are named by biochemists. Includes lots of notes and references on the EC enzyme nomenclature (lots of synonyms).
Gene Synonym Lists This page includes some synonym lists for genes and proteins from several model organisms (A. thaliana, C. elegans, D. melanogaster, H. sapiens, M. musculus, S. cerevisiae) and a mapping for orthologous genes among those organisms. Unfortunately this is just synonymous unique identifiers not common gene/protein names. These lists were developed by Jasmin Saric from the BioInformatics Unit of the Max Delbrück Center in support of some experiments trying to extract yeast regulatory networks from text as described in this paper.


GENIA Corpus GENIA Corpus Version 3.0 consists of 2000 abstracts. The base abstracts are selected from the search results with keywords (MeSH terms) Human, Blood Cells, and Transcription Factors. The corpus is tagged with elements from the GENIA ontology (protein, chemical, etc), part of speech, and co-reference in XML. The GENIA corpus is something of a de facto standard corpus in biomedical text mining. Numerous groups have used this corpus to train and evaluate their system performance. It has one of richest ontologies that has been used for direct text annotation, combined with lots of data.
BioCreAtIve 2004 Corpus This is the corpus used for the BioCreAtIve evlauation. The evaluation had two parts, gene identification (and normalization) and automatic functional annoation using the Gene Ontology annotations.
Nigel Collier's Biol Corpus This collection of abstracts is tagged for mentions of a variety of biologically relevant entities (primarily protein and source organism).
BioMed Central BioMed Central has so far published 4645 articles of peer-reviewed biomedical research, all of which are covered by our open access license agreement which allows free distribution and re-use of the full text article, including the highly structured XML version. As a result, BioMed Central's research article corpus is ideally suited for use by data mining researchers. BioMed Central is promoting datamining research by providing this nice portal to the full text of their journal articles in a consistent XML format. The collection on continues to grow as more articles are reviewed and intergrated. Now someone has to linguistically tag/annotate some of these. As a side note, BioMed Central is seeking research submissions in the area of datamining applied to the biomedical domain.
Yapex Reference and Evaluation Datasets he two collections consist of MEDLINE abstracts obtained in different ways: 1) A document set was obtained by posing the query 'protein binding [Mesh term] AND interaction AND molecular' with the parameters 'abstract', 'english', 'human', and 'publication date 1996-2001' to MEDLINE. From this set 99 abstracts were drawn randomly to form the reference (training) collection. Another non-overlapping set of 48 abstracts was drawn to form a part of the test collection. 2) The remaining 53 abstracts of the test collection were randomly chosen from the GENIA corpus. The protein names of all the abstracts above were annotated by domain experts connected to the Yapex project. Reference collection 99 abstracts constaining 1745 protein names: yapex_ref_collection.xml Test collection 101 abstracts containing 1966 protein names: yapex_test_collection.xml The Yapex system which this was used to develop and evaluate is described here.
MEDLINE abstracts with proteins tagged 1 and proteins tagged 2 Burr Settles (U Wisconsin) pointed me to these data sets from Ray Mooney's group at UT-Austin. The names of proteins seem to be tagged, whereas genes/DNAs/RNAs are not. I am not sure how the second set relates to interactions. I am also left somewhat confused by the tag guidelines that tag things like this: "Activation of the cdc2 protein kinase at different stages..."
PathBinder (from Daniel Berleant at Iowa State) PathBinder is a collection of sentences extracted from MEDLINE. Every sentence contains 2 or more different biomolecules. A dictionary of 40,000 biomolecules (80,000 names) were used to scan against all MEDLINE abstracts. The sentences are organized in a 2-level indexed structure.

Annotation Tools

TIMS (Tag Information Management System) Tag Information Management System (TIMS) Workbench allows user to perform/view tagging on a particular document. Since tag information is stored separately from original documents and managed using an external database software, various different types of tags for the same document can be added. The system also keeps track of the Audit Trail or History, i.e., the date and time, the user or system that performed the tagging etc. TIMS has the facility to export a document from TIMS, at that time all the tag information will be converted to XML format and embedded within the document for portability. Similarly, it is possible to import either an untagged or a tagged XML document into TIMS. TIMS also has facility to search logically for tags, which we call Interval Operations. They are XML specific text/data mining operations which can be performed over TIMS document. The operations can be done over open documents or in batch mode, selecting the files from a list. A very sophisticated tagging tool created for the GENIA project, this java based application uses standard java interfaces to a database server of the users choice (I used Postgres) to hold the tag information. The price of this sophisticatication and customizability is that it can be rather complex and unweildy. It took me quite a while to get it working, although admittedly I didn't know very much about databases or java at the time.
Alembic Workbench The Alembic Workbench project has as its goal the creation of a natural language engineering environment for the development of tagged corpora. To enhance this process, the workbench incorporates a suite of tools for the analysis of a corpus, along with the Alembic system to enable the automatic acquisition of domain-specific tagging heuristics. This is MITRE's very crufty annotation tool written in tcl/tk, and was basically the first of its type. I have used it to do some protein/gene name tagging simply by defining a preference file, but the documentation is somewhat out of date. It can be difficult to install, but usage is very straightforward when actually in operation. This tool hasn't been actively supported in several years, and has been replaced by a tool written in Java named Callisto (see below).
WordFreak WordFreak is a java-based linguistic annotation tool designed to support human, and automatic annotation of linguistic data as well as employ active-learning for human correction of automatically annotated data. This is being used for UPenn's ambitious project Mining the Bibliome: Information Extraction from the Biomedical Literature to parse and annotate biomedical journal abstracts.
Callisto The Callisto annotation tool was developed to support linguistic annotation of textual sources for any Unicode-supported language. Information Extraction (IE) systems are increasingly easy to adapt to varying domains, and by using machine learning techniques, this process is becoming largely automatic. However, adaptive/adaptable systems require training and test data against which to measure and improve their performance. Hand annotation can be an arduous task, but a well designed user interface can greatly ease the burden. This is the function of Callisto. Callisto has been built with a modular design, and utilizes standoff-annotation, allowing for unique tag-set definitions and domain dependent interfaces. Standoff-annotation support, provided by jATLAS, allows for nearly any annotation task to be represented. The modular design of Callisto allows it to be extended with user interface components specific to a domain. Default tag editing capabilities are provided through a highlighted text display, and tag attribute tables. As domain specific extension components are developed, they may be integrated into the core of Callisto, to become part of the standard suite of available components. Callisto is being used by several groups external to the developers with quite a bit of success (including annotating MEDLINE abstracts). Development continues on Callisto improving the interface and increasing functionally for sophisticated annotation (fancy relations tagging tasks, tags from a giant ontology, etc.).

Last modified: Sat Nov 20 12:02:46 EST 2004