Friday, October 21, 2005
2. Identifying information (Optional)
1. What is your name? (Optional)
   Total Respondents  
23
(skipped this question)   10
2. What is your email address? (Optional)
   Total Respondents  
23
(skipped this question)   10
3. Are you a... (Optional)
  Response Percent Response Total
    Researcher
  69% 20
    Student
  20.7% 6
    Industry/contractor
  6.9% 2
    Other (please specify)
  3.4% 1
Total Respondents   29
(skipped this question)   4
3. Corpora
4. Identify the corpora that you have used to build or test a biomedical language processing system, whether or not you have published a paper on that system. (Check all that apply.)
  Response Percent Response Total
  I have never used a corpus to build or test a biomedical language processing system
  8% 2
  BioIE (University of Pennsylvania)
  20% 5
  GENIA corpus (any version)
  76% 19
  GENETAG or the BioCreative 2004 Task 1A corpus
  32% 8
  MEDSTRACT
  12% 3
  Protein Information Resource corpus (Georgetown)
  8% 2
  Protein Design Group's protein-protein interaction corpus (described in Blaschke et al. 1999)
  8% 2
  Wisconsin corpus (described in Craven's publications)
  4% 1
  Yapex
  24% 6
    Other (please list any that you've used)
  24% 6
Total Respondents   25
(skipped this question)   8
5. If you have used a corpus, identify what tasks you have used these corpora for. (Optional)
Entity identification Entity normalization Information extraction Question-answering Summarization Respondent Total
BioIE (University of Pennsylvania)
50% (1) 50% (1) 100% (2) 0% (0) 0% (0) 2
GENIA corpus (any version)
95% (19) 25% (5) 45% (9) 5% (1) 5% (1) 20
GENETAG or the BioCreative 2004 Task 1A corpus
100% (6) 33% (2) 17% (1) 0% (0) 0% (0) 6
MEDSTRACT
67% (2) 67% (2) 33% (1) 0% (0) 0% (0) 3
Protein Information Resource corpus (Georgetown)
0% (0) 0% (0) 100% (1) 0% (0) 0% (0) 1
Protein Design Group's protein-protein interaction corpus (described in Blaschke et al. 1999)
0% (0) 0% (0) 100% (1) 0% (0) 0% (0) 1
Wisconsin corpus (described in Craven's publications)
0% (0) 50% (1) 50% (1) 0% (0) 0% (0) 2
Yapex
86% (6) 43% (3) 29% (2) 14% (1) 0% (0) 7
Total Respondents   21
(skipped this question)   12
6. Identify any corpora that you have ceased using due to problems of distribution format, annotation format, or other issues not related to the content or quality of the data. (Check all that apply.)
  Response Percent Response Total
  I have never ceased using a corpus due to those kinds of problems
  72% 18
  BioIE (University of Pennsylvania)
  0% 0
  GENIA corpus (any version)
  20% 5
  GENETAG or the BioCreative 2004 Task 1A corpus
  4% 1
  MEDSTRACT
  0% 0
  Protein Information Resource corpus (Georgetown)
  0% 0
  Protein Design Group's protein-protein interaction corpus (described in Blaschke et al. 1999)
  4% 1
  Wisconsin corpus (described in Craven's publications)
  0% 0
  Yapex
  0% 0
    Other (please list any that you've used)
  4% 1
Total Respondents   25
(skipped this question)   8
7. If you have ceased using a corpus, please tell us why. (Optional)
   Total Respondents  
5
(skipped this question)   28
4. General
8. Rate the importance of each of these corpus features to your work in biomedical language processing.
Very important Important Not important No opinion Response Average
Format
18% (4) 50% (11) 32% (7) 0% (0) 2.14
Size
27% (6) 59% (13) 5% (1) 9% (2) 1.95
Curation
36% (8) 50% (11) 5% (1) 9% (2) 1.86
Age
5% (1) 18% (4) 50% (11) 27% (6) 3.00
Genre
18% (4) 55% (12) 23% (5) 5% (1) 2.14
Subject matter
36% (8) 27% (6) 18% (4) 18% (4) 2.18
Having the right annotations for my task
77% (17) 14% (3) 9% (2) 0% (0) 1.32
Cost
41% (9) 27% (6) 23% (5) 9% (2) 2.00
Access to interannotator agreement data
27% (6) 41% (9) 14% (3) 18% (4) 2.23
Access to annotation guidelines
36% (8) 50% (11) 9% (2) 5% (1) 1.82
Total Respondents   22
(skipped this question)   11
9. What would be your preferred format for a corpus?
  Response Percent Response Total
    SGML
0% 0
    Embedded XML
  50% 11
    Brill
  4.5% 1
    Standoff annotation (raw text and annotations in separate files)
  40.9% 9
    Other
  4.5% 1
Total Respondents   22
(skipped this question)   11
10. Do you have other comments about these basic issues in corpus design? (Optional)
   Total Respondents  
5
(skipped this question)   28
5. Curation
11. How important is it to your present or future work that a corpus's annotation be manually curated?
Strongly agree Agree No opinion Disagree Strongly disagree Response Average
It is very important for a corpus to be manually curated--uncorrected automatic annotation is not sufficient
59% (13) 23% (5) 5% (1) 14% (3) 0% (0) 1.73
Automatic annotation alone is sufficient, but manual curation and/or manual correction of automatic annotation is better
23% (5) 41% (9) 5% (1) 27% (6) 5% (1) 2.50
Uncurated automatic annotation is sufficient, and manual curation is not necessary
0% (0) 9% (2) 9% (2) 32% (7) 50% (11) 4.23
Total Respondents   22
(skipped this question)   11
12. Do you have any comments about the role of manual curation versus automatic annotation in corpus development? (Optional)
   Total Respondents  
5
(skipped this question)   28
6. Materials
13. Rate the importance of each of these text types.
Very important Important Not important No opinion Response Average
Abstracts of journal articles
48% (10) 52% (11) 0% (0) 0% (0) 1.52
Sentences from abstracts
24% (5) 48% (10) 24% (5) 5% (1) 2.10
Full journal articles
76% (16) 24% (5) 0% (0) 0% (0) 1.24
Material from textbooks
19% (4) 38% (8) 19% (4) 24% (5) 2.48
GeneRIFs
19% (4) 29% (6) 10% (2) 43% (9) 2.76
Definitions from ontologies and controlled vocabularies
52% (11) 24% (5) 10% (2) 14% (3) 1.86
Total Respondents   21
(skipped this question)   12
14. Are there some other types of texts that you think should be included in a biomedical corpus? (Optional)
   Total Respondents  
5
(skipped this question)   28
7. Linguistic and structural features
15. Rate the importance of annotation of each of these structural features in a corpus.
Very important Important Not important No opinion Response Average
Tokens
45% (9) 20% (4) 30% (6) 5% (1) 1.95
Sentence boundaries
45% (9) 25% (5) 25% (5) 5% (1) 1.90
For full text, document sections (e.g. Introduction, Methods, etc.)
50% (10) 35% (7) 10% (2) 5% (1) 1.70
Total Respondents   20
(skipped this question)   13
16. Rate the importance of annotation of each of these kinds of linguistic information.
Very important Important Not important No opinion Response Average
Part of speech
45% (9) 30% (6) 20% (4) 5% (1) 1.85
Lemmas or stems
30% (6) 30% (6) 20% (4) 20% (4) 2.30
Shallow syntactic parse
35% (7) 55% (11) 5% (1) 5% (1) 1.80
Full syntactic parse
35% (7) 30% (6) 15% (3) 20% (4) 2.20
Total Respondents   20
(skipped this question)   13
17. Rate the importance of these kinds of annotations for your work.
Very important Important Not important No opinion Response Average
Named entities
95% (19) 5% (1) 0% (0) 0% (0) 1.05
Relations between entities
85% (17) 15% (3) 0% (0) 0% (0) 1.15
Total Respondents   20
(skipped this question)   13
18. Do you have any further comments about types of structural and linguistic annotation? (Optional)
   Total Respondents  
4
(skipped this question)   29
8. Semantic features
19. Rate the importance of these types of annotations to your work.
Very important Important Not important No opinion Response Average
Coreference
40% (8) 40% (8) 15% (3) 5% (1) 1.85
Acronym/abbreviation definitions
35% (7) 50% (10) 10% (2) 5% (1) 1.85
Total Respondents   20
(skipped this question)   13
20. Rate the importance of annotation with the following kinds of named entities to your work.
Very important Important Not important No opinion Response Average
Genes and gene products
85% (17) 10% (2) 0% (0) 5% (1) 1.25
Cell lines
20% (4) 55% (11) 10% (2) 15% (3) 2.20
Diseases
55% (11) 25% (5) 10% (2) 10% (2) 1.75
Drug names
50% (10) 25% (5) 15% (3) 10% (2) 1.85
Chemical names
40% (8) 40% (8) 10% (2) 10% (2) 1.90
Populations
10% (2) 30% (6) 35% (7) 25% (5) 2.75
Anatomy
30% (6) 40% (8) 25% (5) 5% (1) 2.05
Species
45% (9) 40% (8) 5% (1) 10% (2) 1.80
Strains
10% (2) 45% (9) 30% (6) 15% (3) 2.50
Dosages
5% (1) 35% (7) 40% (8) 20% (4) 2.75
Gene Ontology concepts
50% (10) 45% (9) 0% (0) 5% (1) 1.60
UMLS concepts
25% (5) 45% (9) 20% (4) 10% (2) 2.15
Total Respondents   20
(skipped this question)   13
21. Are there some other classes of entities that would be important to your current or future work, or do you have any further comments about semantic annotation? (Optional)
   Total Respondents  
3
(skipped this question)   30
9. Task types
22. Which of these do you see as the three highest priorities for future data set construction efforts? (We include here tasks for which text collections, rather than corpora, are used.)
Highest priority 2nd highest priority 3rd highest priority Response Average
Question answering
10% (1) 40% (4) 50% (5) 2.40
Information extraction
53% (10) 37% (7) 11% (2) 1.58
Entity identification (named entity recognition)
14% (1) 29% (2) 57% (4) 2.43
Summarization
10% (1) 40% (4) 50% (5) 2.40
Information retrieval
33% (2) 17% (1) 50% (3) 2.17
Entity normalization
62% (5) 25% (2) 12% (1) 1.50
Total Respondents   20
(skipped this question)   13
23. Do you have any further comments about biomedical corpora for specific task types? (Optional)
   Total Respondents  
0
(skipped this question)   33
10. Final comments
24. Please share any other thoughts on corpus design or comments on this survey. (Optional)
   Total Respondents  
2
(skipped this question)   31

Copyright 1999-2004 SurveyMonkey.com