What information extraction is

Information extraction is finding specific kinds of information in texts. Unlike natural language understanding, or trying to actually understand texts, in information extraction we just try to recognize assertions about very specific kinds of facts. For example, given an input text like John Denver will be performing at the Denver Buffalo Company in Denver, CO on Oct. 29, 1929, you would expect the output of a natural language understanding system to allow you to answer questions like Will John Denver be alive on Oct. 29, 1929? and Will John Denver's body be located in America on Oct. 29, 1929?. In contrast, you would expect the output of an information extraction system to be much less ambitious--perhaps it would extract the fact that there is a performance event, the performer is named John Denver, the date of the performance is Oct. 29, 1929, and the location is the Denver Buffalo Company. (Recognizing that John Denver is a "performer" and Denver Buffalo Company is a place where a performance can take place is an instance of the entity identification task.)

The information extraction task was first, or at least most explicitly, defined in the context of a series of competitions called the Message Understanding Conferences. Examples of classic MUC information extraction tasks include finding statements about terrorist attacks---the types of information extracted included the place, victims, and terrorist organization responsible---and statements about corporate acquisitions.

Information extraction in the molecular biology domain

In recent years there has been a considerable amount of interest in the molecular biology and bioinformatics community in applying information extraction tasks to molecular biology texts. The typical input type envisioned by workers in this area is abstracts of journal articles. Typical targets include protein-protein interactions and subcellular localization of proteins. The major challenge has been the entity identification problem. Although identification systems which can perform with high degrees of accuracy are available for typical MUC-like tasks, similar levels of performance have remained elusive in the molecular biology domain. This hampers progress in information extraction considerably.

