Entity identification

What entity identification is

Entity identification, also known as named entity recognition, is the location of names of things in text. For instance, given a text like John Denver will be playing at the Denver Buffalo Company in Denver, CO,, an entity identification system should locate John Denver, Denver Buffalo Company, and Denver, CO. Typically the task is defined as not just locating the names, but also assigning them to one of a given set of classes. So, for the example sentence, you would want your system to realize that John Denver is a person, that Denver Buffalo Company is a company (a restaurant, actually), and that Denver, CO is a place.

The classic definitions of the entity identification task came out of a series of competitions called the Message Understanding Conferences. Their definitions of "name" and "entity" were somewhat broader than you might expect, including things like monetary amounts and times.

Entity identification in the molecular biology domain

Over the past few years there has been a lot of interest in entity identification in the bioinformatics and molecular biology communities. This is the area in which I've been working the most over the past couple of years. In these communities, "entity" generally means "gene or protein," and many systems don't attempt to recognize more than this single (combined) class, which makes the problem somewhat simpler, on one level. Overall, the problem actually seems to be much harder--see my paper (in submission to ISMB) for details on a number of reasons why this might be the case. To see a demo of a system that identifies genes and proteins in text, click here. For a demo of another system, click here.

Entity identification in molecular biology texts was recently the subject of a competition known as BioCreative--see the website for details. We hope that this competition will push progress in the field.

Where to go to read more about entity identification

For general information on entity identification, a good place to start is Chapter 5 of Jackson and Moulinier's Natural language processing for online applications. Then check out some of the papers on entity identification listed at CiteSeer. There are many good papers on entity identification in the molecular biology domain--here are some to start with, either because they're heavily cited, or because they're just plain good, or for both reasons:

Contact me

Return to my web page
Send me email