Corpora for biomedical natural language processing
A project of the Biomedical Text Mining Group
at the Center for Computational Pharmacology
Lab: RC-1 S. Room L18-6400A
Phone: 303-916-2417

Home Obtaining corpora Publications Empirical data on corpus usage Corpus design Survey data

Counting words in six biomedical corpora

The six corpora discussed in the paper are distributed in six different formats, so producing a word count for each one required a separate parser. The final version of the paper didn't have room for giving the details of how we came up with the word counts reported, so the following sections explain what we counted in each corpus, and where relevant, gives the code that we used.

Word count for the PDG corpus

The PDG corpus is distributed as a single HTML file. We removed all HTML formatting from the file. This leaves a file in which comments, annotations, and text are all in the same format, but are on separate lines. We hand-edited all comments and annotations from the file, leaving just the text, and then used the unix wc command to count whitespace-tokenized words. This gives a count of 10,291. See pdg_corpus.txt for the file with all HTML, comments, and annotations removed.

Word count for the Wisconsin corpus

Size for the U. Wisconsin corpus is based on the first line of all data files in the MIPS/all (1,080,265 words), OMIM/all (291,397 words), and YPD/all (158,069 words) directories. See for exactly what got counted as a word.

Word count for the GENIA corpus

Size for the GENIA corpus is based on the file GENIAcorpus3.02.pos.txt. This file has each token on a separate line, so one of the points for this one is to avoid counting tokens that aren't words. The script shows what exactly was ignored and what was counted.

Word count for the Yapex corpus

Size for the Yapex corpus is based on the <ArticleTitle> and <AbstractText> elements in the yapex_ref_collection.txt (23,049 words) and yapex_test_collection.txt (22,094 words) files. We extracted the plain text from the XML, whitespace-tokenized it, and counted the resulting tokens.

Word count for the GENETAG corpus

Size for the GENETAG corpus is based on the TAGGED_GENE_CORPUS files in the train (170,832 words), test (56,761 words), and round1 (114,981 words) directories. Round2 data is described in the paper, but is not included in the current distribution. Tanabe et al. give much higher word counts (204,195 for train, 68,043 for test, and 137,586 for round1), but I suspect that those counts are after tokenization of punctuation, and are counts of tokens, not words.

[an error occurred while processing this directive]
This document last modified 08/09/10 13:08.