What information extraction is
Information extraction is finding specific kinds of information in texts. Unlike natural language understanding, or trying to actually understand texts, in information extraction we just try to recognize assertions about very specific kinds of facts. For example, given an input text like John Denver will be performing at the Denver Buffalo Company in Denver, CO on Oct. 29, 1929, you would expect the output of a natural language understanding system to allow you to answer questions like Will John Denver be alive on Oct. 29, 1929? and Will John Denver's body be located in America on Oct. 29, 1929?. In contrast, you would expect the output of an information extraction system to be much less ambitious--perhaps it would extract the fact that there is a performance event, the performer is named John Denver, the date of the performance is Oct. 29, 1929, and the location is the Denver Buffalo Company. (Recognizing that John Denver is a "performer" and Denver Buffalo Company is a place where a performance can take place is an instance of the entity identification task.)
The information extraction task was first, or at least most explicitly, defined in the context of a series of competitions called the Message Understanding Conferences. Examples of classic MUC information extraction tasks include finding statements about terrorist attacks---the types of information extracted included the place, victims, and terrorist organization responsible---and statements about corporate acquisitions.
Information extraction in the molecular biology domain
In recent years there has been a considerable amount of interest in the molecular biology and bioinformatics community in applying information extraction tasks to molecular biology texts. The typical input type envisioned by workers in this area is abstracts of journal articles. Typical targets include protein-protein interactions and subcellular localization of proteins. The major challenge has been the entity identification problem. Although identification systems which can perform with high degrees of accuracy are available for typical MUC-like tasks, similar levels of performance have remained elusive in the molecular biology domain. This hampers progress in information extraction considerably.
Where to go to read more about information extraction
Things to read on general information extraction:
Things to read on information extraction in the molecular biology domain:
- Craven, Mark and Johan Kumlein (1999) Constructing biological knowledge bases by extracting information from text sources. Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology (ISMB-99), pp. 77-86. AAAI Press. Applies Bayesian classification to answer the question of whether or not a sentence makes an assertion about two entities.
- Ono, Toshihide; Haretsugu Hishigaki; Akira Tanigami; and Toshihisa Takagi (2001): Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics 17(2):155-161. A classic cascaded FST system, in the spirit of FASTUS. If you like this, see Leroy et al. (2003) below.
- Rindflesch et al.: Mining molecular binding terminology from biomedical text.
- Rindflesch et al.: EDGAR: extraction of drugs, genes and relations from the biomedical literature.
- Marcotte et al.: Mining literature for protein-protein interactions. Bioinformatics 17(4):359-363.
- Friedman, Carol; Pauline Kra; Hong Yu; Michael Krauthammer; and Adrey Rzhetsky (2001): GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17(Suppl. 1)S74-S82.
- Thomas et al.: Automatic extraction of protein interactions from scientific abstracts.
- J. Ding, D. Berleant, D. Nettleton, and E. Wurtele (2002) Mining Medline: abstracts, sentences, or phrases? PSB 2002. An empirical answer to the question of what the appropriate input size is for an information extraction system.