Corpora for biomedical natural language processing
A project of the Biomedical Text Mining Group
at the Center for Computational Pharmacology
Lab: RC-1 S. Room L18-6400A
Phone: 303-916-2417
E-mail: Kevin.Cohen@gmail.com

Home Obtaining corpora Publications Empirical data on corpus usage Corpus design Survey data

Obtaining corpora and text collections for biomedical natural language processing

This page provides links to various publicly available corpora and text collections for biomedical natural language processing. If you are aware of any that I've missed, please let me know. If you're looking for tools for biomedical natural language processing, see Alex Morgan's page at this link. You can find other items of interest at the BioNLP web site.

AIMed ftp://ftp.cs.utexas.edu/pub/mooney/bio-data/
Bio1 http://research.nii.ac.jp/~collier/resources/bio1.1.xml
BioCreative 2006 http://biocreative.sourceforge.net/biocreative_2_dataset.html
PennBioIE http://bioie.ldc.upenn.edu/
BioInfer http://www.it.utu.fi/BioInfer/?q=download
BioText http://biotext.berkeley.edu/data.html
Brown-GENIA Treebank http://www.cog.brown.edu/Research/nlp/resources.html#genia
DepGenia http://www.ifi.unizh.ch/cl/kalju/download/depgenia/
EDGAR ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/EDGAR_GS.txt
FlySlip hedge classification data http://www.cl.cam.ac.uk/~bwm23/hedgeclassif.html
FlySlip NER abstracts http://www.wiki.cl.cam.ac.uk/rowiki/NaturalLanguage/FlySlip/Flyslip-resources?action=AttachFile&do=get&target=abstr-ner-corpus.iob.gz
FlySlip NER full text http://www.wiki.cl.cam.ac.uk/rowiki/NaturalLanguage/FlySlip/Flyslip-resources?action=AttachFile&do=get&target=full-paper-ner.iob.gz
GENIA http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/
GREC (Gene Regulation Event Corpus) http://www.nactem.ac.uk/GREC
FetchProt http://www.sics.se/humle/projects/fetchprot/
IEPA http://class.ee.iastate.edu/berleant/s/IEPA.htm
iProLink http://pir.georgetown.edu/iprolink/
Medstract http://www.medstract.org/gold-standards.html LINK BROKEN as of 5/16/2008--looking for new one...
MedTag ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedTag/medtag.tar.gz
OHSUMED http://ir.ohsu.edu/ohsumed/
PICorpus http://bionlp.sourceforge.net/PICorpus/index.shtml
Protein Design Group (see also PICorpus, a later version of this) http://www.pdg.cnb.uam.es/medline_interactions/
TREC Genomics 2004, 2005, and 2006 http://ir.ohsu.edu/genomics/data.html
Wisconsin http://www.biostat.wisc.edu/~craven/ie/
WSD http://wsd.nlm.nih.gov/
Yapex http://www.sics.se/humle/projects/prothalt/



Home Obtaining corpora Publications Empirical data on corpus usage Corpus design Survey data

This document last modified 08/09/10 13:08.