The Center Dogma

alt text

Lecture Sypnopsis

This lecture focuses on how to store different genomic information using Bioconductor objects (as in Object Orientated Programming). We will begin by learning how to store simple biological bases in Biostrings package; DNA, RNA, and protein, and how to use this fundamanetal data structure to build a genome using BSgenome pacakge. We then switch to learning how to represent genomic interval using the GenomicRanges pacakge, and end with learning how to store biological annotation information for transcript using the GenomicFeatures pacakge.

Molecular Bases using Biostrings

At its simplest form, genetic sequence are represented in nucleotides; DNA and RNA. Bioconductor stores the sequence information in a class called XString which consistes of 4 subclass:

subclass BString

BString is a generic container that can be used to store any characters

library(Biostrings)

a = BString("I am Tzu")

## In the spirit of OOP, we can perfrom various function on the BString object

## Return the length or number of element in the object
length(a)
## [1] 8
## Subsetting different elements of the object
a[1:4]
##   4-letter "BString" instance
## seq: I am
## or
subseq(a, 1, 4)
##   4-letter "BString" instance
## seq: I am
## Reverting the sequence/string: very common because of DNA complementary
rev(a)
##   8-letter "BString" instance
## seq: uzT ma I
## Instead of storing the string in BString object, we can simply convert it back to a Character object
toString(a)
## [1] "I am Tzu"
## Just to ensure of their class definiation:
class(a)
## [1] "BString"
## attr(,"package")
## [1] "Biostrings"
class(toString(a))
## [1] "character"

Exercise:
What is the opposite of “stressed”? Use BString to get the answer.

subclass DNAString and RNAString

Biostrings comes with a few global variables for commonly used base definitions. For example, the International Union of Pure and Applied Chemistry IUPAC denote different alphabets to represent various combinations of genetic bases.

IUPAC_CODE_MAP
##      A      C      G      T      M      R      W      S      Y      K 
##    "A"    "C"    "G"    "T"   "AC"   "AG"   "AT"   "CG"   "CT"   "GT" 
##      V      H      D      B      N 
##  "ACG"  "ACT"  "AGT"  "CGT" "ACGT"
DNA_ALPHABET
##  [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
## [18] "."
DNA_BASES
## [1] "A" "C" "G" "T"
RNA_ALPHABET
##  [1] "A" "C" "G" "U" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
## [18] "."
RNA_BASES
## [1] "A" "C" "G" "U"

The different between BString and DNAString or RNAString is what is allowed to be stored.

## DNAString will not take anything other then DNA alphabet
(a = DNAString('I am a string'))

## Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW),  : 
##  key 73 (char 'I') not in lookup table
## Only store the single representation of DNA bases
(a = DNAString('ATTGCC'))
##   6-letter "DNAString" instance
## seq: ATTGCC

If you accidentally use RNAString object to store DNA sequence, an error will occur …

## How about storing the sequence in RNAString object
(b = RNAString('ATTGCC'))

## Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW),  : 
## key 84 (char 'T') not in lookup table

Use RNAString to store RNA sequences

(b=RNAString("AUUGCC"))
##   6-letter "RNAString" instance
## seq: AUUGCC

We can apply simple functions onto the objects

## Simple frequency counting
alphabetFrequency(a)
## A C G T M R W S Y K V H D B N - + . 
## 1 2 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Only count single letter base
alphabetFrequency(a, baseOnly = TRUE)
##     A     C     G     T other 
##     1     2     1     2     0
## We can also count a single element
letterFrequency(a, "C")
## C 
## 2
## Or a combination of letters
letterFrequency(a, "CG")
## C|G 
##   3

Exercise:
What is the significant of “CG” occurances? Hint: methylation …

## Perform complements
a
##   6-letter "DNAString" instance
## seq: ATTGCC
complement(a)
##   6-letter "DNAString" instance
## seq: TAACGG
reverseComplement(a)
##   6-letter "DNAString" instance
## seq: GGCAAT

String Matching and Alignment

There are 4 matching or alignment function in BioStrings:

  • Finding occurances of a given pattern:
    • matchPattern
    • countPattern
    • vmatchPattern
    • vcountPattern
  • Matching a dictionary of patterns against a reference:
    • matchPDict, countPDict
  • Matching/counting with position Weight Matrix (PWM): (Not the scope of this lecture)
    • matchPWM, countPWM, PWMscoresStartingAt
  • Global/local alignment: (Not the scope of this lecture)
    • pairwiseAlignment, stringDist

matchPattern

matchPattern is a fast implementation to find occurances of a given pattern in a larger sequence, that allow mismatch and insertion/deletions (indels)

## Give a subject sequence
a=DNAString("ACGTACGTACGC")

## Find "CGT" within the subject sequece a
matchPattern('CGT', a)
##   Views on a 12-letter DNAString subject
## subject: ACGTACGTACGC
## views:
##     start end width
## [1]     2   4     3 [CGT]
## [2]     6   8     3 [CGT]
## We can also build in mismatch parameter
matchPattern('CGT', a, max.mismatch = 1)
##   Views on a 12-letter DNAString subject
## subject: ACGTACGTACGC
## views:
##     start end width
## [1]     2   4     3 [CGT]
## [2]     6   8     3 [CGT]
## [3]    10  12     3 [CGC]
## To retrieve all the starting position
m = matchPattern('CGT', a, max.mismatch = 1)
start(m)
## [1]  2  6 10
## To retrieve all the end position
end(m)
## [1]  4  8 12
## To retrieve how many matches
length(m)
## [1] 3
## Or, we can use "countPattern" to return the count directly
countPattern('CGT', a, max.mismatch = 1)
## [1] 3

matchPDict

matchPattern can only match one pattern at a time, to match more then one patterns, use matchPDict

## Given a subject sequecne to be matched against
a=DNAString("ACGTACGTACGC")

## To match more then one pattern, we need to build a "dictionary" using the PDict function
dict0=PDict(c("CGT","ACG"))

## Match results will be stored in a "list" object
mm=matchPDict(dict0, a)

## Let look at each element in the resulting "list"
mm[[1]]
## IRanges of length 2
##     start end width
## [1]     2   4     3
## [2]     6   8     3
mm[[2]]
## IRanges of length 3
##     start end width
## [1]     1   3     3
## [2]     5   7     3
## [3]     9  11     3

XStringViews vs. XStringSet

  • XStringViews is an object of multiple strings that are derived from a “mother” string, and put into a string “view”. It contains multiple “views” (start/end locations) of the same string
  • XStringSet is an instance of the string – more operation can be done …

XStringViews

## Given a DNA string
a=DNAString("ACGTACGTACTC")

## Create a StringView object
a2=Views(a, start=c(1,5,8), end=c(3,8,12))

## Displaying "a2" shows that it contains 3 elements
a2
##   Views on a 12-letter DNAString subject
## subject: ACGTACGTACTC
## views:
##     start end width
## [1]     1   3     3 [ACG]
## [2]     5   8     4 [ACGT]
## [3]     8  12     5 [TACTC]
## However, when using subject() to identify it object identify, it is actually a DNA string.  Therefore, the 3 elements we saw are actually a derivative of the original DNA string
subject(a2)
##   12-letter "DNAString" instance
## seq: ACGTACGTACTC
## There are a few operations can be used in the XStringViews object
length(a2)
## [1] 3
start(a2)
## [1] 1 5 8
end(a2)
## [1]  3  8 12
alphabetFrequency(a2, baseOnly=T)
##      A C G T other
## [1,] 1 1 1 0     0
## [2,] 1 1 1 1     0
## [3,] 1 2 0 2     0
## Find similar string
a2 == DNAString('ACGT')
## [1] FALSE  TRUE FALSE
## Of course, we can always convert it into a character object
toString(a2)
## [1] "ACG, ACGT, TACTC"

XStringSet

Recall that XStringSet, such as DNAStringSet is actually an instance of the object (Not a derivative)

## Given a subject DNA string
a=DNAString("ACGTACGTACTC")

## Build a DNAStringSet of 3 smaller sequences
a2=DNAStringSet(a, start=c(1,5,9), end=c(4,8,12))

## Displaying "a2" shows that it is an "DNAStringSet" object
a2
##   A DNAStringSet instance of length 3
##     width seq
## [1]     4 ACGT
## [2]     4 ACGT
## [3]     4 ACTC
## Looking at its indivudual element shwos that it is built from the fundamental "DNAString" object
a2[[1]]
##   4-letter "DNAString" instance
## seq: ACGT
## There are many similar operations can be done to both "View" and "Set"
alphabetFrequency(a2, baseOnly = T)
##      A C G T other
## [1,] 1 1 1 1     0
## [2,] 1 1 1 1     0
## [3,] 1 2 0 1     0
length(a2)
## [1] 3
## This is where the similarity ends.  

## Set Operations can only be done using "XStringSet" objects
a1 = DNAStringSet(a, start=c(1,5,9), end=c(4,8,12))

a1  
##   A DNAStringSet instance of length 3
##     width seq
## [1]     4 ACGT
## [2]     4 ACGT
## [3]     4 ACTC
unique(a1)
##   A DNAStringSet instance of length 2
##     width seq
## [1]     4 ACGT
## [2]     4 ACTC
a2=Views(a, start=c(1,5,9), end=c(4,8,12))

unique(a2)
# Error in base::match(x, table, nomatch = nomatch, incomparables = incomparables,  : 
#  'match' requires vector arguments

XStringSet: set Operations

a1=DNAStringSet(a, start=c(1,9), end=c(4,12))
a1
##   A DNAStringSet instance of length 2
##     width seq
## [1]     4 ACGT
## [2]     4 ACTC
a2=DNAStringSet(a, start=c(1), end=c(4))
a2
##   A DNAStringSet instance of length 1
##     width seq
## [1]     4 ACGT
setdiff(a1,a2)
##   A DNAStringSet instance of length 1
##     width seq
## [1]     4 ACTC
union(a1,a2)
##   A DNAStringSet instance of length 2
##     width seq
## [1]     4 ACGT
## [2]     4 ACTC

Exercise:
Try to see if the set operation works on Views objects??

Question: So what is Views object good for? Why don’t we just use Set object?

XStringSet: use vmatchPatter to match with multiple strings

a=DNAString("ACGTACGTACTC")
a2=DNAStringSet(a, start=c(1,5,9), end=c(4,8,12))
vv=vmatchPattern("CG", a2)
vv
## MIndex object of length 3
## [[1]]
## IRanges of length 1
##     start end width
## [1]     2   3     2
## 
## [[2]]
## IRanges of length 1
##     start end width
## [1]     2   3     2
## 
## [[3]]
## IRanges of length 0

Storing the Whole Genome using BSgenome

library(BSgenome)
## TO see what genome is available
available.genomes()
length(available.genomes())

Install the Celegans Genome using biocLite()

## Let's install the Celegans Genome
## This is the easiest way to install BioConductor packages ...
source("http://bioconductor.org/biocLite.R")
biocLite('BSgenome.Celegans.UCSC.ce10')

Loading the genome

library(BSgenome.Celegans.UCSC.ce10)

ls('package:BSgenome.Celegans.UCSC.ce10')
## [1] "BSgenome.Celegans.UCSC.ce10" "Celegans"
Celegans
## Worm genome:
## # organism: Caenorhabditis elegans (Worm)
## # provider: UCSC
## # provider version: ce10
## # release date: Oct. 2010
## # release name: WormBase v. WS220
## # 7 sequences:
## #   chrI   chrII  chrIII chrIV  chrV   chrX   chrM                       
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[['
## # operator to access a given sequence)

Basic operations on the genome

names(Celegans)
## [1] "chrI"   "chrII"  "chrIII" "chrIV"  "chrV"   "chrX"   "chrM"
Celegans$chrI
##   15072423-letter "DNAString" instance
## seq: GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCT...GGCTTAGGCTTAGGCTTAGGTTTAGGCTTAGGC
## Data are not loaded until accessed

seqnames(Celegans)
## [1] "chrI"   "chrII"  "chrIII" "chrIV"  "chrV"   "chrX"   "chrM"
seqlengths(Celegans)
##     chrI    chrII   chrIII    chrIV     chrV     chrX     chrM 
## 15072423 15279345 13783700 17493793 20924149 17718866    13794
## What is the frequency of each base on chromosome I
alphabetFrequency(Celegans$chrI, baseOnly=T)
##       A       C       G       T   other 
## 4835938 2695881 2692152 4848452       0
## What is the percentage of each base on chromosome I
alphabetFrequency(Celegans$chrI, baseOnly=T)/length(Celegans$chrI)
##         A         C         G         T     other 
## 0.3208468 0.1788618 0.1786144 0.3216770 0.0000000
## Use "matchPattern" to find "CG" island on chromosome I
mm = matchPattern("CG", Celegans$chrI)

length(mm)
## [1] 503521
## "XStringView" object was called to see matched results -- save space ...
mm
##   Views on a 15072423-letter DNAString subject
## subject: GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGC...CTTAGGCTTAGGCTTAGGTTTAGGCTTAGGC
## views:
##             start      end width
##      [1]      492      493     2 [CG]
##      [2]      526      527     2 [CG]
##      [3]      869      870     2 [CG]
##      [4]     1006     1007     2 [CG]
##      [5]     1027     1028     2 [CG]
##      ...      ...      ...   ... ...
## [503517] 15072170 15072171     2 [CG]
## [503518] 15072173 15072174     2 [CG]
## [503519] 15072178 15072179     2 [CG]
## [503520] 15072205 15072206     2 [CG]
## [503521] 15072245 15072246     2 [CG]