Homework 3

In this homework, you are being tested on these elements you have learned about R:

Your Assignment

To generate this table:

… and this plot:

Bioconductor re-introduction

Good Artist Copy, Great Artist Steal

source

To become a great analyst, you must start from imitating other people’s code. For this assignment, we will use this Bioconductor workflow (click here). This is a workflow for microarray data analysis. Don’t worry too much about the details for now. The goal of this exercise is to learn how others do this, and imitate the process.

Bioconductor is a repository for genetic/genomic R library/packages. A library is designed to:

In order to use any of the library, once must run these code in a R console:

## Load the code for function `biocLite.R`
source("https://bioconductor.org/biocLite.R")

## So that you can use the function like such ...
biocLite()

This exercise also required you to download the arrays file from the website (please download the compressed package based on your computer OS), and install the package as shown here:

install.packages('arrays_0.1.tgz', 
                repos = NULL, 
                type = 'source')

arrays workflow step by step

load library

Note: please use these codes to install the libraries if you have not done that already:

Note: these libraries might rely on other libraries. Luckily R will take care of all the dependencies during the installation process.


biocLite('affy')
biocLite('limma')

Load the library; affy and limma

## Load packages
library(affy)   # Affymetrix pre-processing
library(limma)  # two-color pre-processing; differential
                  # expression
library(arrays)

Import phenotype data

This set of code will retrieve the phenotype information for the experiment from the arrays pacakge you downloaded earlier. It uses the system.file( ) function to pull down the information from the package="arrays"

## import "phenotype" data, describing the experimental design
phenoData <- 
    read.AnnotatedDataFrame(system.file("extdata", "pdata.txt",
    package="arrays"))

RMA normalization

Data normalizaton is very important to remove any systematic variation between samples. For example, sometimes the samples might not have be ran on the same day, and the batch effect could error on different days where the experiments were ran. Normalizaton will take care of this error, and make all samples back to the level plain field and camparable between each others. RMA is a very popular normalizaton methold and it being implemented on the following codes:

## RMA normalization
celfiles <- system.file("extdata", package="arrays")
eset <- justRMA(phenoData=phenoData,
    celfile.path=celfiles)

Differential expression

The goal of microarray data analysis is to find differentially expressed genes between different conditions; such as WT vs. Treatment. This following codes implement the statistical anlaysis using the limma package.

Note the final code; topTable(efit, coef=2) where it returns the top 10 most significant genes.

combn <- factor(paste(pData(phenoData)[,1],
    pData(phenoData)[,2], sep = "_"))
design <- model.matrix(~combn) # describe model to be fit

fit <- lmFit(eset, design)  # fit each probeset to model
efit <- eBayes(fit)        # empirical Bayes adjustment
topTable(efit, coef=2)      # table of differentially expressed probesets

Your Assignment

Create a final table for your client

Pick up significant genes

Select gene with P value < 0.05

  • First, use topTable to extract 3000 genes (use the number argument)
    • Use ?topTable to learn more about how to do this …
  • Use the column P.Value to select genes at cutoff of 0.05
  • This is the topTable box on the diagram

ggplot2 Volcano plot

In this exercise, we will plot the Volcano plot

Go to this link (click here) to learn more about how to plot a volcano plot

  • Scroll to Using ggplot2 for volcano plots section
  • Use logFC for x-axis and P.Value for y-axis (don’t forget to negative logged the y-axis)
  • Use adj.P.Val to set the color threshold
  • Set xlim(c(-5,5))