2016-11-17

Bioconductor - A brief review

biomaRt package for annotation

Gene centric AnnotationDbi packages:

  • Organism level: org.Mm.eg.db
  • Platform level: hgu133plus2.db
  • System biology level: GO.db, KEGG.db

Genome centric GenomicFeatures package:

  • Transcriptomes level: TxDb.Hsapiens.UCSC.hg19.knownGene

biomaRt

  • An R API into the biomart annotations

biomaRt idea

Primary - Foreign Keys

Mart is a collection of Tables that networked together coherently; using primary and foreign keys:

biomaRt (R package)

biomaRt is an R interface to Biomart (www.biomart.org), a system for integrating across a wide range of biological annotation databases
 
There are 3 parts to query the database

getBM(
  attributes= ... ,
  filters= ... ,
  values= ...
)

List Available Mart

library(biomaRt)

listMarts()
##                biomart               version
## 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 86
## 2   ENSEMBL_MART_MOUSE      Mouse strains 86
## 3     ENSEMBL_MART_SNP  Ensembl Variation 86
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 86
## 5    ENSEMBL_MART_VEGA               Vega 66
ensembl.m = useMart(biomart = "ENSEMBL_MART_ENSEMBL")

List Available Datasets

listDatasets(ensembl.m)
##                           dataset
## 1          oanatinus_gene_ensembl
## 2         cporcellus_gene_ensembl
## 3         gaculeatus_gene_ensembl
## 4  itridecemlineatus_gene_ensembl
## 5          lafricana_gene_ensembl
## 6         choffmanni_gene_ensembl
## 7          csavignyi_gene_ensembl
## 8             fcatus_gene_ensembl
## 9        rnorvegicus_gene_ensembl
## 10         psinensis_gene_ensembl
## 11          cjacchus_gene_ensembl
## 12        ttruncatus_gene_ensembl
## 13       scerevisiae_gene_ensembl
## 14          celegans_gene_ensembl
## 15          csabaeus_gene_ensembl
## 16        oniloticus_gene_ensembl
## 17        amexicanus_gene_ensembl
## 18         trubripes_gene_ensembl
## 19          pmarinus_gene_ensembl
## 20        eeuropaeus_gene_ensembl
## 21       falbicollis_gene_ensembl
## 22         etelfairi_gene_ensembl
## 23     cintestinalis_gene_ensembl
## 24      ptroglodytes_gene_ensembl
## 25       nleucogenys_gene_ensembl
## 26           sscrofa_gene_ensembl
## 27        ocuniculus_gene_ensembl
## 28     dnovemcinctus_gene_ensembl
## 29         pcapensis_gene_ensembl
## 30          tguttata_gene_ensembl
## 31        mlucifugus_gene_ensembl
## 32          hsapiens_gene_ensembl
## 33          pformosa_gene_ensembl
## 34        tbelangeri_gene_ensembl
## 35             mfuro_gene_ensembl
## 36           ggallus_gene_ensembl
## 37       xtropicalis_gene_ensembl
## 38         ecaballus_gene_ensembl
## 39           pabelii_gene_ensembl
## 40            drerio_gene_ensembl
## 41        xmaculatus_gene_ensembl
## 42     tnigroviridis_gene_ensembl
## 43        lchalumnae_gene_ensembl
## 44      amelanoleuca_gene_ensembl
## 45          mmulatta_gene_ensembl
## 46         pvampyrus_gene_ensembl
## 47           panubis_gene_ensembl
## 48        mdomestica_gene_ensembl
## 49     acarolinensis_gene_ensembl
## 50            vpacos_gene_ensembl
## 51         tsyrichta_gene_ensembl
## 52        ogarnettii_gene_ensembl
## 53     dmelanogaster_gene_ensembl
## 54          mmurinus_gene_ensembl
## 55         loculatus_gene_ensembl
## 56          olatipes_gene_ensembl
## 57         oprinceps_gene_ensembl
## 58          ggorilla_gene_ensembl
## 59            dordii_gene_ensembl
## 60            oaries_gene_ensembl
## 61         mmusculus_gene_ensembl
## 62        mgallopavo_gene_ensembl
## 63           gmorhua_gene_ensembl
## 64          saraneus_gene_ensembl
## 65    aplatyrhynchos_gene_ensembl
## 66         sharrisii_gene_ensembl
## 67          meugenii_gene_ensembl
## 68           btaurus_gene_ensembl
## 69       cfamiliaris_gene_ensembl
##                                   description           version
## 1      Ornithorhynchus anatinus genes (OANA5)             OANA5
## 2             Cavia porcellus genes (cavPor3)           cavPor3
## 3      Gasterosteus aculeatus genes (BROADS1)           BROADS1
## 4  Ictidomys tridecemlineatus genes (spetri2)           spetri2
## 5          Loxodonta africana genes (loxAfr3)           loxAfr3
## 6         Choloepus hoffmanni genes (choHof1)           choHof1
## 7              Ciona savignyi genes (CSAV2.0)           CSAV2.0
## 8         Felis catus genes (Felis_catus_6.2)   Felis_catus_6.2
## 9          Rattus norvegicus genes (Rnor_6.0)          Rnor_6.0
## 10     Pelodiscus sinensis genes (PelSin_1.0)        PelSin_1.0
## 11  Callithrix jacchus genes (C_jacchus3.2.1)    C_jacchus3.2.1
## 12         Tursiops truncatus genes (turTru1)           turTru1
## 13   Saccharomyces cerevisiae genes (R64-1-1)           R64-1-1
## 14    Caenorhabditis elegans genes (WBcel235)          WBcel235
## 15      Chlorocebus sabaeus genes (ChlSab1.1)         ChlSab1.1
## 16    Oreochromis niloticus genes (Orenil1.0)         Orenil1.0
## 17       Astyanax mexicanus genes (AstMex102)         AstMex102
## 18            Takifugu rubripes genes (FUGU4)             FUGU4
## 19    Petromyzon marinus genes (Pmarinus_7.0)      Pmarinus_7.0
## 20       Erinaceus europaeus genes (HEDGEHOG)          HEDGEHOG
## 21     Ficedula albicollis genes (FicAlb_1.4)        FicAlb_1.4
## 22           Echinops telfairi genes (TENREC)            TENREC
## 23              Ciona intestinalis genes (KH)                KH
## 24         Pan troglodytes genes (CHIMP2.1.4)        CHIMP2.1.4
## 25        Nomascus leucogenys genes (Nleu1.0)           Nleu1.0
## 26             Sus scrofa genes (Sscrofa10.2)       Sscrofa10.2
## 27    Oryctolagus cuniculus genes (OryCun2.0)         OryCun2.0
## 28     Dasypus novemcinctus genes (Dasnov3.0)         Dasnov3.0
## 29          Procavia capensis genes (proCap1)           proCap1
## 30    Taeniopygia guttata genes (taeGut3.2.4)       taeGut3.2.4
## 31         Myotis lucifugus genes (Myoluc2.0)         Myoluc2.0
## 32             Homo sapiens genes (GRCh38.p7)         GRCh38.p7
## 33      Poecilia formosa genes (PoeFor_5.1.2)      PoeFor_5.1.2
## 34         Tupaia belangeri genes (TREESHREW)         TREESHREW
## 35 Mustela putorius furo genes (MusPutFur1.0)      MusPutFur1.0
## 36    Gallus gallus genes (Gallus_gallus-5.0) Gallus_gallus-5.0
## 37         Xenopus tropicalis genes (JGI_4.2)           JGI_4.2
## 38             Equus caballus genes (EquCab2)           EquCab2
## 39                 Pongo abelii genes (PPYG2)             PPYG2
## 40                 Danio rerio genes (GRCz10)            GRCz10
## 41  Xiphophorus maculatus genes (Xipmac4.4.2)       Xipmac4.4.2
## 42  Tetraodon nigroviridis genes (TETRAODON8)        TETRAODON8
## 43        Latimeria chalumnae genes (LatCha1)           LatCha1
## 44     Ailuropoda melanoleuca genes (ailMel1)           ailMel1
## 45          Macaca mulatta genes (Mmul_8.0.1)        Mmul_8.0.1
## 46          Pteropus vampyrus genes (pteVam1)           pteVam1
## 47             Papio anubis genes (PapAnu2.0)         PapAnu2.0
## 48      Monodelphis domestica genes (BROADO5)           BROADO5
## 49      Anolis carolinensis genes (AnoCar2.0)         AnoCar2.0
## 50              Vicugna pacos genes (vicPac1)           vicPac1
## 51           Tarsius syrichta genes (tarSyr1)           tarSyr1
## 52         Otolemur garnettii genes (OtoGar3)           OtoGar3
## 53      Drosophila melanogaster genes (BDGP6)             BDGP6
## 54        Microcebus murinus genes (Mmur_2.0)          Mmur_2.0
## 55       Lepisosteus oculatus genes (LepOcu1)           LepOcu1
## 56            Oryzias latipes genes (MEDAKA1)           MEDAKA1
## 57             Ochotona princeps genes (pika)              pika
## 58          Gorilla gorilla genes (gorGor3.1)         gorGor3.1
## 59            Dipodomys ordii genes (dipOrd1)           dipOrd1
## 60                Ovis aries genes (Oar_v3.1)          Oar_v3.1
## 61             Mus musculus genes (GRCm38.p4)         GRCm38.p4
## 62           Meleagris gallopavo genes (UMD2)              UMD2
## 63               Gadus morhua genes (gadMor1)           gadMor1
## 64        Sorex araneus genes (COMMON_SHREW1)     COMMON_SHREW1
## 65    Anas platyrhynchos genes (BGI_duck_1.0)      BGI_duck_1.0
## 66      Sarcophilus harrisii genes (DEVIL7.0)          DEVIL7.0
## 67          Macropus eugenii genes (Meug_1.0)          Meug_1.0
## 68                  Bos taurus genes (UMD3.1)            UMD3.1
## 69         Canis familiaris genes (CanFam3.1)         CanFam3.1

Do them together

ensembl.m = useMart(biomart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl")

Creating a Biomart Query

3 parts of the query:

  • attributes:
    • What you want to retrieve.
    • A vector of attributes
      • e.g. ensembl_gene_id
  • filters:
    • Property of the attributes
    • A vector of Filters used to qualify or constrain the attributes
  • values:
    • Values for the Filters
    • A list of vectors, where each position in the list correponds to the position of the Filter in the Filter argument

Information on attributes

listAttributes(ensembl.m)

grep(pattern = "gene", listAttributes(ensembl.m)[0.1])

attributePages(ensembl.m)

listAttributes(ensembl.m, page = "feature_page")

Information on filters

listFilters(ensembl.m)

filterType("start", ensembl.m)

filterOptions("chromosome_name", ensembl.m)

Make a query on the mart

affids = c('202763_at', '209310_s_at', '207500_at')

getBM(attributes = c('affy_hg_u133_plus_2', 
                     'entrezgene', 
                     'uniprot_genename'),
      filters = 'affy_hg_u133_plus_2', 
      values = affids,
      mart = ensembl.m)

System Biology: Gene Ontology Example

snp biomart

snp.mart = useMart("ENSEMBL_MART_SNP")

listDatasets(snp.mart)

snp.mart = useMart(biomart = "ENSEMBL_MART_SNP", dataset = "hsapiens_snp")

listAttributes(snp.mart)

listFilters(snp.mart)

Query the snp mart

snps = c('rs769449', 'rs514716', 
         'rs514716', 'rs9877502', 
         'rs514716', 'rs6922617')

snp.q = getBM(attributes = c('refsnp_id', 
                              'allele', 
                              'chrom_start', 
                              'ensembl_gene_stable_id'),
              filters = c('snp_filter'),
              values = list(snps),
              mart = snp.mart)

Query between 2 chromosome positions

snp.q2 = getBM(attributes = c('refsnp_id', 'allele', 
                              'chrom_start', 'chrom_strand'),
               filters = c('chr_name', 'start', 'end'),
               values = list(8, 148350, 148612),
               mart = snp.mart)
getBM(attributes = c("refsnp_id", "allele", "chrom_start", "chrom_strand"), 
    filters = c("chr_name", "start", "end"), values = list(8, 148350, 148612), 
    mart = snp.mart)

Homework 5

  • Goto bioconductor biomaRt page (click here)
  • From the Documentation section; select HTML
  • Study how to use the biomaRt manual
  • For Homework 5, while learning the biomaRt package; you will modified the existing code on specific segment of the manual.

Homework 5: Question 1: add description

Segment 4.1: Annotate a set of Affymetrix identifiers with HUGO symbol and chromosomal locations of corresponding genes

ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")

affyids=c("202763_at","209310_s_at","207500_at")
getBM(attributes = c('affy_hg_u133_plus_2', 'hgnc_symbol', 'chromosome_name',
                   'start_position', 'end_position', 'band'),
      filters = 'affy_hg_u133_plus_2', 
      values = affyids, 
      mart = ensembl)

Your task: 2 pt
Add an addition attributes of gene description

Homework 5: Question 2: Pathway Information

Segment 4.2: Annotate a set of EntrezGene identifiers with GO annotation

entrez=c("673","837")
goids = getBM(attributes = c('entrezgene', 'go_id'), 
              filters = 'entrezgene', 
              values = entrez, 
              mart = ensembl)
head(goids)

Your task: 2 pt
Can you add a reactome ID?

Reactome Pathway Database (Click here)

Homework 5: Question 3: retrieve by location and strand info

Segment 4.3: Retrieve all HUGO gene symbols of genes that are located on chromosomes 17,20 or Y, and are associated with specific GO terms

go=c("GO:0051330","GO:0000080","GO:0000114","GO:0000082")
chrom=c(17,20,"Y")
getBM(attributes= "hgnc_symbol",
        filters=c("go_id","chromosome_name"),
        values=list(go, chrom), mart=ensembl)

Your task: 2 pt
Can you only pull up genes on the positive strand?
Note: positive strand = "1", negative strand = "-1"

Homework 5: Question 4: Protein position info

Segment 4.4: Annotate set of idenfiers with INTERPRO protein domain identifiers

refseqids = c("NM_005359","NM_000546")
ipro = getBM(attributes=c("refseq_mrna",
                          "interpro",
                          "interpro_description"), 
             filters="refseq_mrna",
             values=refseqids, 
             mart=ensembl)
ipro

Your task: 2 pt
Add the protein start and end positions

Homework 5: Question 5: find gene between coordinates

Segment 4.5: Select all Affymetrix identifiers on the hgu133plus2 chip and Ensembl gene identifiers for genes located on chromosome 16 between basepair 1100000 and 1250000.

getBM(attributes = c('affy_hg_u133_plus_2','ensembl_gene_id'), 
      filters = c('chromosome_name','start','end'),
      values = list(16,1100000,1250000), 
      mart = ensembl)

Your task: 2 pt
Do the same for Chromosome 1 between basepair 2 million to 3 million, and add a gene description columns

Homework 5: Question 6: find functional genes

Segment 4.6: Retrieve all entrezgene identifiers and HUGO gene symbols of genes which have a “MAP kinase activity” GO term associated with it.

getBM(attributes = c('entrezgene','hgnc_symbol'), 
      filters = 'go_id', 
      values = 'GO:0004707', 
      mart = ensembl)

Your task: 2 pt
Do the same for "Glycolysis" GO term

Homework 5: Question 7: getSequence

Segment 4.7: Given a set of EntrezGene identifiers, retrieve 100bp upstream promoter sequences

entrez=c("673","7157","837")
getSequence(id = entrez, 
            type="entrezgene",
            seqType="coding_gene_flank",
            upstream=100, 
            mart=ensembl) 

Your task: 2 pt
Now retrieve only the exons of these genes
Tips: read segment 4.7 …

Homework 5: Question 8: Retrieve SNP

Segment 4.10: Retrieve known SNPs located on the human chromosome 8 between positions 148350 and 148612

snpmart = useMart(biomart = "ENSEMBL_MART_SNP", dataset="hsapiens_snp")

getBM(attributes = c('refsnp_id','allele','chrom_start','chrom_strand'), 
      filters = c('chr_name','start','end'), 
      values = list(8,148350,148612), 
      mart = snpmart)

Your task: 3 pt
Retrieve all SNP for Entrez Gene ID 3630
Tips: you might have to use getBM to get the chromosome, start and end from ensembl first …

Homework 5: Question 9: find homolog

Segment 4.11: Given the human gene TP53, retrieve the human chromosomal location of this gene and also retrieve the chromosomal location and RefSeq id of its homolog in mouse

human = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mouse = useMart("ensembl", dataset = "mmusculus_gene_ensembl")
getLDS(attributes = c("hgnc_symbol","chromosome_name", "start_position"),
       filters = "hgnc_symbol", values = "TP53",mart = human,
      attributesL = c("refseq_mrna","chromosome_name","start_position"), martL = mouse)

Your task: 3 pt
Now find the homolog for RAT
Tips: use listDataset to find the RAT info

How to submit homework 5

Plase submit your homework 5 in Rmd format  
So that I can run your codes

Happy ThanksGiving