Entity identification
What entity identification is
Entity identification, also known as named entity recognition, is the location of names of things in text. For instance, given a text like John Denver will be playing at the Denver Buffalo Company in Denver, CO,, an entity identification system should locate John Denver, Denver Buffalo Company, and Denver, CO. Typically the task is defined as not just locating the names, but also assigning them to one of a given set of classes. So, for the example sentence, you would want your system to realize that John Denver is a person, that Denver Buffalo Company is a company (a restaurant, actually), and that Denver, CO is a place.
The classic definitions of the entity identification task came out of a series of competitions called the Message Understanding Conferences. Their definitions of "name" and "entity" were somewhat broader than you might expect, including things like monetary amounts and times.
Entity identification in the molecular biology domain
Over the past few years there has been a lot of interest in entity identification in the bioinformatics and molecular biology communities. This is the area in which I've been working the most over the past couple of years. In these communities, "entity" generally means "gene or protein," and many systems don't attempt to recognize more than this single (combined) class, which makes the problem somewhat simpler, on one level. Overall, the problem actually seems to be much harder--see my paper (in submission to ISMB) for details on a number of reasons why this might be the case. To see a demo of a system that identifies genes and proteins in text, click here. For a demo of another system, click here.
Entity identification in molecular biology texts was recently the subject of a competition known as BioCreative--see the website for details. We hope that this competition will push progress in the field.
Where to go to read more about entity identification
For general information on entity identification, a good place to start is Chapter 5 of Jackson and Moulinier's Natural language processing for online applications. Then check out some of the papers on entity identification listed at CiteSeer. There are many good papers on entity identification in the molecular biology domain--here are some to start with, either because they're heavily cited, or because they're just plain good, or for both reasons:
- Fukuda, K.; T. Tsunoda; A. Tamura; and T. Takagi.(1998) Toward information extraction: identifying protein names from biological papers. PSB 1998, 705-716. The single most-often-cited paper in molbio entity identification. The foundational paper on rule-based approaches.
- Collier, Nigel; Nobata, Chikashi; and Jun-ichi Tsujii (2000) Extracting the names of genes and gene products with a hidden Markov model. Proceedings of COLING 2000, pp. 201-207.
- Tanabe and Wilbur (2002): Tagging gene and protein names in biomedical text. Bioinformatics 18(8):1124-1132.
- Krauthammer, Michael; Andrey Rzhetsky; Pavel Morozov; and Carol Friedman (2000) Using BLAST for identifying gene and protein names in journal articles. Gene 259(1-2):245-252.
- Hatzivassiloglou et al., 2001: Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics 17, Suppl. 1:S97-S106.
- Cohen, K. Bretonnel; Andrew E. Dolbey; George K. Aquaah-Mensah; and Lawrence Hunter (2002) Contrast and variability in gene names. Proceedings of the Workshop on Natural Language Processing in the Biomedical Domain, pp. 14-20.
Contact me
Return to my web page
Send me email