1 Background

This tutorial outlines a transcriptomic RNAseq analysis workflow module conducted in my course; R + Biocondcutor data analysis course (BIOS6660) at the Anschutz Medical Campus, University of Colorado, Denver. We use a dataset from a collaborator, Dr. Eric Schmidt (PI), to investigate pulmonary endothelial glycocalyx recovery after sepsis. To learn more about RNA-seq technology, please refer to these 2 early publications. (see Wang 2011 and Pepke (2009))

1.1 Analysis Goals

The PI is interested in identifying differentially expressed genes comparing CLP vs. Sham in a mouse model.


2 Experimental Design

The dataset contains two comparison groups; CLP and Sham, with a sample size of 3 for each. Since RNAseq dataset is typically very large, we have extracted only Chromosome 19 from each sample, using the seqtk tool to ease the computational process on the students’ personal computer, which are mostly less powerful then server-class computers. The goal of this course is to enable students to perform data analysis using their owned resources.

2.1 Experimental Descriptive File

First, we will read in the experimental descriptive data file which lists the sample names and summarizes the grouping of each samples.

## Read in descriptive data file
targets = read.table('./data/targets.txt', header = T)
## printing file table
knitr::kable(targets)
FileName SampleName Factor Factor_long
CLP_lung_48h_rep1_chr19_read1.fastq CLP_lung_48h_rep1 CLP CLP_lung_48h_rep1
CLP_lung_48h_rep2_chr19_read1.fastq CLP_lung_48h_rep2 CLP CLP_lung_48h_rep2
CLP_lung_48h_rep3_chr19_read1.fastq CLP_lung_48h_rep3 CLP CLP_lung_48h_rep3
Sham_lung_48h_rep1_chr19_read1.fastq Sham_lung_48h_rep1 Sham Sham_lung_48h_rep1
Sham_lung_48h_rep2_chr19_read1.fastq Sham_lung_48h_rep2 Sham Sham_lung_48h_rep2
Sham_lung_48h_rep3_chr19_read1.fastq Sham_lung_48h_rep3 Sham Sham_lung_48h_rep3

It seems like we have a total of 6 FASTQ files and 2 experimental grouping (CLP vs. Sham) in mouse sample


3 Quality Control of FASTQ Files

3.1 FASTQ QC with seqTools

We used a Bioconductor tool; seqTools to assess the quality of FASTQ file. Note that the seqTools package document a details description of the FASTQ format, check it out.

library(seqTools)
filenames = c(
  './data/CLP_lung_48h_rep1_chr19_read1.fastq',
  './data/CLP_lung_48h_rep2_chr19_read1.fastq',
  './data/CLP_lung_48h_rep3_chr19_read1.fastq',
  './data/Sham_lung_48h_rep1_chr19_read1.fastq',
  './data/Sham_lung_48h_rep2_chr19_read1.fastq',
  './data/Sham_lung_48h_rep3_chr19_read1.fastq'
  )
fq = fastqq(filenames = filenames, probeLabel = c(paste('CLP',1:3, sep = '-'), paste('Sham',4:6,sep='-')))
pdf(file = './qc/seqTools.fastq.qc.pdf')
plotPhredQuant(fq, 1)
plotPhredQuant(fq, 2)
plotPhredQuant(fq, 3)
plotPhredQuant(fq, 4)
plotPhredQuant(fq, 5)
plotPhredQuant(fq, 6)
dev.off()

plotMergedPhredQuant(fq, main = 'Phred Quantiles for all files')

3.2 Non-R Approach 1: FASTQC

FASTQC is a Graphic User Interface (GUI) tool designed to perform quality assessment of FASTQ files

fastq.list = c(
  './data/CLP_lung_48h_rep1_chr19_read1.fastq',
  './data/CLP_lung_48h_rep2_chr19_read1.fastq',
  './data/CLP_lung_48h_rep3_chr19_read1.fastq',
  './data/Sham_lung_48h_rep1_chr19_read1.fastq',
  './data/Sham_lung_48h_rep2_chr19_read1.fastq',
  './data/Sham_lung_48h_rep3_chr19_read1.fastq'
)
for(i in 1:length(fastq.list)){
  
  output.dir.o = '-o'
  output.dir = './qc/'
  
  
    args.cond = c(
    output.dir.o, output.dir,
        fastq.list[i]
    )
    system2('fastqc', args = args.cond, wait = F)
    
}

3.3 Non-R Approach 2: RSeQC

RSeQC is an RNA-seq quality control package written in Python. This Python package also include many QC utilities for SAM/BAM files.


4 Mapping FASTQ to genome

4.1 Mapping with QuasR

We will be using QuasR Bioconductor package to map the FASTQ files onto the mouse genome to determine the read genomic coordinate. QuasR is an extremely versatile NGS mapping and postprocessing pipeline for RNA-seq. It uses Rbowtie for upgapped alignments and SpliceMap for spliced alignments

Note: QuasR is an intelligent package; if it finds BAM files already exists from the previous runs, it will not generate the file again. If we want to do a new run, we need to delete everything in the “result” folder …

Note: It is best to put eval=FALSE before running knitr to avoid long mapping processes.

library(QuasR)
targets = read.table("./data/targets.txt", header = T)
write.table(targets[,1:2], 'data/QuasR_samples.txt', row.names=F, quote=F, sep='\t')
sampleFile = "./data/QuasR_samples.txt"
genomeFile = "./data/Mouse.chromosome.19.fa"
results = "./result"
cl = makeCluster(10)

## Single command to index reference, align all samples and generate BAM files
proj <- qAlign(sampleFile, genome=genomeFile, maxHits=1, splicedAlignment=T, alignmentsDir=results, clObj=cl, cacheDir=results)
# Note: splicedAlignment should be set to TRUE when the reads are >=50nt long  
alignstats <- alignmentStats(proj) # Alignment summary report
#knitr::kable(alignstats)

4.1.1 Alignment Summary

The following enumerates the number of reads in each FASTQ file and how many of them aligned to the reference. Note: the percentage of aligned reads is 100% in this particular example because only alignable reads were selected when generating the sample FASTQ files for this exercise. For QuasR this step can be omitted because the qAlign function generats this information automatically.

library(ShortRead); library(Rsamtools)
## Extract bam file names:
bam.filenames = proj@alignments$FileName

Nreads <- countLines(dirPath="./data/", pattern=".fastq$")/4
bfl <- BamFileList(bam.filenames, yieldSize=50000, index=character())
Nalign <- countBam(bfl, param=ScanBamParam(flag=scanBamFlag(isUnmappedQuery=F)))
#(read_statsDF <- data.frame(FileName=names(Nreads), Nreads=Nreads, Nalign=Nalign$records, Perc_Aligned=Nalign$records/Nreads*100))
#write.table(read_statsDF, "./result/read_statsDF.txt", row.names=FALSE, quote=FALSE, sep="\t")

## Remove rownames before display
rownames(read_statsDF) = c()
knitr::kable(read_statsDF)
FileName Nreads Nalign Perc_Aligned
CLP_lung_48h_rep1_chr19_read1.fastq 1290664 1168495 90.53441
CLP_lung_48h_rep2_chr19_read1.fastq 1324913 1200681 90.62338
CLP_lung_48h_rep3_chr19_read1.fastq 1467134 1323209 90.19006
Sham_lung_48h_rep1_chr19_read1.fastq 2094335 1905972 91.00607
Sham_lung_48h_rep2_chr19_read1.fastq 1666375 1507205 90.44813
Sham_lung_48h_rep3_chr19_read1.fastq 1955572 1783449 91.19833

4.2 Mapping with bowtie2 using the terminal

For Mac and Linux users, if you are lucky enough have the proper computational resources, and have installed bowtie2 in your system. Here are some codes for running bowtie2 from R using the system2 function.

fastq.list = c(
  './data/CLP_lung_48h_rep1_chr19_read1.fastq',
  './data/CLP_lung_48h_rep2_chr19_read1.fastq',
  './data/CLP_lung_48h_rep3_chr19_read1.fastq',
  './data/Sham_lung_48h_rep1_chr19_read1.fastq',
  './data/Sham_lung_48h_rep2_chr19_read1.fastq',
  './data/Sham_lung_48h_rep3_chr19_read1.fastq'
)

#for(i in 1){
for(i in 1:length(fastq.list)){
  
      cat('Doing ', fastq.list[i], ' now .............................\n\n')
  
      # Bowtie2 mapping ---------------------------------------------------------
      
      process.p = '-p'
      process.n = '10'
      genome.x = '-x'
      genome.path = '/Volumes/Tzu_Pegasus/Tzu_iMac_Documents/MyGenome/Mus_musculus_UCSC_mm9/Mus_musculus/UCSC/mm9/Sequence/Bowtie2Index/genome'
      input.U = '-U'
      input.file = fastq.list[i]
      output.S = '-S'
      if(length(grep('fastq.gz', input.file))){
        output.file = gsub('.fastq.gz', '.sam', input.file)
      }else if(length(grep('fastq', input.file))){
        output.file = gsub('.fastq', '.sam', input.file)
      }
      output.file = gsub('./data/', './bowtie2_results/', output.file)
      
      
      args.cond = c(
          process.p, process.n,
          genome.x, genome.path,
          input.U, input.file,
          output.S, output.file
        )
      
      cat('Running bowtie2 now ............................... \n')
      ptm = proc.time()
      system2('bowtie2', args = args.cond)
      time.ellapsed = proc.time() - ptm
      cat('Ellapsed time is ' , time.ellapsed[3]/60, ' minutes\n')
      
      
      # samtools view
      samtools.comd = 'view'
      samtools.input.bS = '-bS'
      samtools.input = output.file
      samtools.direct = '>'
      samtools.output = gsub('.sam', '.bam', samtools.input)
      
      args.cond = c(
          samtools.comd,
          samtools.input.bS, samtools.input,
          samtools.direct,
          samtools.output
        )
      cat('Running samtools view now ............................... \n')
      ptm = proc.time()
      system2('samtools', args = args.cond)
      time.ellapsed = proc.time() - ptm
      cat('Ellapsed time is ' , time.ellapsed[3]/60, ' minutes\n')
      
      # samtools sort
      samtools.comd = 'sort'
      samtools.input = samtools.output
      samtools.output = gsub('.bam', '_sorted', samtools.input)
      
      args.cond = c(
        samtools.comd,
        samtools.input,
        samtools.output
      )
      
      cat('Running samtools sort now ............................... \n')
      ptm = proc.time()
      system2('samtools', args = args.cond)
      time.ellapsed = proc.time() - ptm
      cat('Ellapsed time is ' , time.ellapsed[3]/60, ' minutes\n')
      
      
      # samtools index
      samtools.comd = 'index'
      samtools.input = paste(samtools.output,'.bam', sep = '')
      
      args.cond = c(
        samtools.comd,  
        samtools.input
      )
      
      cat('Running samtools index now ............................... \n')
      ptm = proc.time()
      system2('samtools', args = args.cond)
      time.ellapsed = proc.time() - ptm
      cat('Ellapsed time is ' , time.ellapsed[3]/60, ' minutes\n')
      
      
       
}

4.3 FASTQ Quality Report revisited by QuasR

The following shows how to create read quality reports with QuasR’s qQCReport function or with the custom seeFastq function from UC Rivierside, Dr. Girke group

qQCReport(proj, pdfFilename="./qc/qc_report.pdf")
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/fastqQuality.R")
myfiles <- paste0("data/", targets$FileName); names(myfiles) <- targets$SampleName
fqlist <- seeFastq(fastq=myfiles, batchsize=50000, klength=8)
pdf("./qc/fastqReport.pdf", height=18, width=4*length(myfiles)); seeFastqPlot(fqlist); dev.off()

5 Gene Read Count

5.1 Annotations of Transcriptome

Storing annotation ranges in TranscriptDb databases makes many operations more robust and convenient

library(GenomicFeatures)
txdb <- makeTranscriptDbFromGFF(file="./data/genes.gtf",
                                format="gtf",
                                dataSource="mm9",
                                species="Mus musculus")
saveDb(txdb, file="./data/mouse_annotation.sqlite")
txdb <- loadDb("./data/mouse_annotation.sqlite") 
eByg <- exonsBy(txdb, by="gene")

5.2 Generate Read Count Table

The gene signal was obtained by overlapping sequened reads onto the pre-defined gene region ranges. The resulting count table was further filtered to remove genes with zero counts.

samples <- as.character(targets$Factor_long)
samples = gsub('_lung_48h_','.',samples)
samplespath <- bam.filenames
names(samplespath) <- samples
countDF <- data.frame(row.names=names(eByg))
for(i in samplespath) {
  aligns <- readGAlignmentsFromBam(i) # Substitute next two lines with this one.
  seqnames(aligns) = rep('chr19', length(aligns))
  counts <- countOverlaps(eByg, aligns, ignore.strand=TRUE)
  countDF <- cbind(countDF, counts)
}


colnames(countDF) <- samples

## Remove row with all zeros
row.sum = rowSums(countDF)
chr19.countDF = countDF[row.sum != 0,]

knitr::kable(chr19.countDF[1:4,], align = 'c')
CLP.rep1 CLP.rep2 CLP.rep3 Sham.rep1 Sham.rep2 Sham.rep3
1110059E24Rik 2 0 1 1 1 3
1500015L24Rik 1 4 0 2 2 5
1500017E21Rik 0 1 0 2 0 1
1700001K23Rik 40 21 33 34 28 29
write.table(chr19.countDF, "./result/chr19_countDF", quote=FALSE, sep="\t", col.names = NA)
countDF <- read.table("./result/chr19_countDF")

5.3 Simple RPKM Normalization

RPKM: reads per kilobase of exon model per million mapped reads

returnRPKM <- function(counts, gffsub) {
  geneLengthsInKB <- sum(width(reduce(gffsub)))/1000 # Length of exon union per gene in kbp
  millionsMapped <- sum(counts)/1e+06 # Factor for converting to million of mapped reads.
  rpm <- counts/millionsMapped # RPK: reads per kilobase of exon model.
  rpkm <- rpm/geneLengthsInKB # RPKM: reads per kilobase of exon model per million mapped reads.
  return(rpkm)
}
countDFrpkm <- apply(countDF, 2, function(x) returnRPKM(counts=x, gffsub=eByg[rownames(countDF)]))

knitr::kable(countDFrpkm[1:4,], align = 'c')
CLP.rep1 CLP.rep2 CLP.rep3 Sham.rep1 Sham.rep2 Sham.rep3
1110059E24Rik 1.779027 0.000000 0.84371 0.5254325 0.6985178 1.6985821
1500015L24Rik 1.632680 6.627559 0.00000 1.9288369 2.5642245 5.1961767
1500017E21Rik 0.000000 0.484245 0.00000 0.5637247 0.0000000 0.3037284
1700001K23Rik 104.116816 55.471850 81.47334 52.2762243 57.2326512 48.0475975

5.3.1 Hierarchical Clustering

QC check of the sample reproducibility by computing a correlating matrix and plotting it as a tree. Note: the plotMDS function from edgeR is a more robust method for this task.

library(ape) 
d <- cor(countDFrpkm, method="spearman")
hc <- hclust(dist(1-d))
plot.phylo(as.phylo(hc), type="p", edge.col=4, edge.width=3, show.node.label=TRUE, no.margin=TRUE)

As expected, CLP and Sham samples clustered together respectively

6 Identification of DEGs with various methods

6.1 Identify DEGs with Simple Fold Change Method

## Compute mean values for replicates
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/colAg.R")
countDFrpkm_mean <- colAg(myMA=countDFrpkm, group=c(1,1,1,2,2,2), myfct=mean)
knitr::kable(countDFrpkm_mean[1:4,])
CLP.rep1_CLP.rep2_CLP.rep3 Sham.rep1_Sham.rep2_Sham.rep3
1110059E24Rik 0.8742456 0.9741774
1500015L24Rik 2.7534129 3.2297461
1500017E21Rik 0.1614150 0.2891510
1700001K23Rik 80.3540010 52.5188243
## Log2 fold change
countDFrpkm_mean <- cbind(countDFrpkm_mean, log2ratio=log2(countDFrpkm_mean[,2]/countDFrpkm_mean[,1]))
countDFrpkm_mean <- countDFrpkm_mean[is.finite(countDFrpkm_mean[,3]), ]
degs2fold <- countDFrpkm_mean[countDFrpkm_mean[,3] >= 1 | countDFrpkm_mean[,3] <= -1,]
knitr::kable(degs2fold[1:4,], align = 'c')
CLP.rep1_CLP.rep2_CLP.rep3 Sham.rep1_Sham.rep2_Sham.rep3 log2ratio
1700019N19Rik 0.4351094 1.6436593 1.917461
4930524O05Rik 3.5604664 17.6887380 2.312693
5730408K05Rik 3.2674506 7.6869962 1.234255
5830416P10Rik 0.6330198 0.1256593 -2.332733
write.table(degs2fold, "./result/degs2fold.txt", quote=FALSE, sep="\t", col.names = NA)
degs2fold <- read.table("./result/degs2fold.xls")

6.2 Identify DEGs with DESeq Library

library(DESeq)
countDF <- read.table("./result/chr19_countDF")
conds <- targets$Factor
cds <- newCountDataSet(countDF, conds) # Creates object of class CountDataSet derived from eSet class
knitr::kable(counts(cds)[1:4, ]) # CountDataSet has similar accessor methods as eSet class.
CLP.rep1 CLP.rep2 CLP.rep3 Sham.rep1 Sham.rep2 Sham.rep3
1110059E24Rik 2 0 1 1 1 3
1500015L24Rik 1 4 0 2 2 5
1500017E21Rik 0 1 0 2 0 1
1700001K23Rik 40 21 33 34 28 29
cds <- estimateSizeFactors(cds) # Estimates library size factors from count data. Alternatively, one can provide here the true library sizes with sizeFactors(cds) <- c(..., ...)
cds <- estimateDispersions(cds) # Estimates the variance within replicates
res <- nbinomTest(cds, "CLP", "Sham") # Calls DEGs with nbinomTest
res <- na.omit(res)
res2fold <- res[res$log2FoldChange >= 1 | res$log2FoldChange <= -1,]
res2foldpadj <- res2fold[res2fold$padj <= 0.05, ]
knitr::kable(res2foldpadj[1:4,1:8])
id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
43 Acta2 287.77458 384.673260 190.87591 0.4962027 -1.010998 0.0000000 0.0000000
101 Carns1 119.19376 75.208883 163.17865 2.1696725 1.117477 0.0000000 0.0000001
111 Cd5 11.23337 4.651678 17.81507 3.8298153 1.937275 0.0007125 0.0088632
134 Cpt1a 3664.99583 5109.789409 2220.20224 0.4344998 -1.202573 0.0000000 0.0000000

6.3 Identify DEGs with edgeR’s Exact Method

library(edgeR)
countDF <- read.table("./result/chr19_countDF")
y <- DGEList(counts=countDF, group=conds) # Constructs DGEList object
y <- estimateCommonDisp(y) # Estimates common dispersion
y <- estimateTagwiseDisp(y) # Estimates tagwise dispersion
et <- exactTest(y, pair=c("CLP", "Sham")) # Computes exact test for the negative binomial distribution.
topTags(et, n=4)
## Comparison of groups:  Sham-CLP 
##              logFC    logCPM       PValue          FDR
## Slc22a12 -3.420277  5.888933 1.705319e-36 8.697127e-34
## Ms4a18   -2.037469  7.191865 6.677911e-32 1.702867e-29
## Ms4a15   -1.789720  7.243905 2.055271e-24 3.493961e-22
## Cpt1a    -1.465574 11.952793 8.843267e-24 1.127517e-21
edge <- as.data.frame(topTags(et, n=50000)) 
edge2fold <- edge[edge$logFC >= 1 | edge$logFC <= -1,]
edge2foldpadj <- edge2fold[edge2fold$FDR <= 0.01, ]

6.4 Idenfity DEGs with edgeR’s GLM Approach

library(edgeR)
countDF <- read.table("./result/chr19_countDF")
y <- DGEList(counts=countDF, group=conds) # Constructs DGEList object
## Filtering and normalization
keep <- rowSums(cpm(y)>1) >= 2; y <- y[keep, ]
y <- calcNormFactors(y)
design <- model.matrix(~0+group, data=y$samples); colnames(design) <- levels(y$samples$group) # Design matrix
## Estimate dispersion
y <- estimateGLMCommonDisp(y, design, verbose=TRUE) # Estimates common dispersions
## Disp = 0.00727 , BCV = 0.0853
y <- estimateGLMTrendedDisp(y, design) # Estimates trended dispersions
y <- estimateGLMTagwiseDisp(y, design) # Estimates tagwise dispersions 
## Fit the negative binomial GLM for each tag
fit <- glmFit(y, design) # Returns an object of class DGEGLM
contrasts <- makeContrasts(contrasts="CLP-Sham", levels=design) # Contrast matrix is optional
lrt <- glmLRT(fit, contrast=contrasts[,1]) # Takes DGEGLM object and carries out the likelihood ratio test. 
edgeglm <- as.data.frame(topTags(lrt, n=length(rownames(y))))
## Filter on fold change and FDR
edgeglm2fold <- edgeglm[edgeglm$logFC >= 1 | edgeglm$logFC <= -1,]
edgeglm2foldpadj <- edgeglm2fold[edgeglm2fold$FDR <= 0.01, ]

6.5 Comparison Among DEG Results

source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R")
setlist <- list(edgeRexact=rownames(edge2foldpadj), edgeRglm=rownames(edgeglm2foldpadj), DESeq=as.character(res2foldpadj[,1]), RPKM=rownames(degs2fold))
OLlist <- overLapper(setlist=setlist, sep="_", type="vennsets")
counts <- sapply(OLlist$Venn_List, length)
vennPlot(counts=counts, mymain="DEG Comparison")

Number of common genes among all 4 methods: 14

6.6 Heatmap of Top Ranking DEGs

library(lattice); library(gplots)
y <- countDFrpkm[rownames(edgeglm2foldpadj)[1:15],]
colnames(y) <- targets$Factor
y <- t(scale(t(as.matrix(y))))
y <- y[order(y[,1]),]
levelplot(t(y), height=0.2, col.regions=colorpanel(40, "darkblue", "yellow", "white"), main="Expression Values (DEG Filter: FDR 1%, FC > 2)", colorkey=list(space="top"), xlab="", ylab="Gene ID")

References

Pepke, Shirley. 2009. “Computation for ChIP-Seq and RNA-Seq Studies.” Nat Methods 6 (11). Nature Publishing Group: S22–32. doi:10.1038/nmeth.1371. http://www.nature.com/nmeth/journal/v6/n11s/full/nmeth.1371.html.

Wang, Martin JA. 2011. “Next-Generation Transcriptome Assembly.” Nat Rev Genet 12 (10). Nature Publishing Group: 671–82. doi:10.1038/nrg3068. http://www.nature.com/nrg/journal/v12/n10/full/nrg3068.html.