Corpora for biomedical natural language processing
A project of the Biomedical Text Mining Group
at the Center for Computational Pharmacology
Corpus design for biomedical natural language processing

This page provides a link to, and supplemental material for, our 2005 ACL-ISMB BioLINK 2005 paper. The paper discusses data on the design and usage rates of six corpora constructed to support research in biomedical natural language processing. Based on that data, we make a number of recommendations for future corpus construction work.

  • Cohen, K. Bretonnel; Lynne Fox; Philip V. Ogren; and Lawrence Hunter (2005) Corpus design for biomedical natural language processing. Proceedings of the ACL-ISMB workshop on linking biological literature, ontologies and databases: mining biological semantics, pp. 38-45.
  • How we did the word counts: A symptom of the wide diversity of formats in which biomedical corpora are currently distributed is that we had to write code to count words in each corpus separately. This page gives information what we counted for each one.

