Lecture Synopsis

Today we will learn about R object and function, and way to navigate and manipulate them.

For beginner programmer, please visit TryR site to update your R programming skill and/or watched Roger Peng’s Cousera Week 1’s Video, let’s practice what you have learned.

alt text

Object and Function

R is all about object and function(). Function always ends with a parenthesis (). The first function() you will see is sample() where you can randomly draw elements/members from a list.

alt text

Topic: Vectors

Counting DNA Nucleotides

Problem: Given a string of nucleotide alphabet (A,C,G,T), count the occurance of each base using vectors as your programming tool

#*#############
#*########
## Code Chunk 1

## Randomly select 100 base characters
dna.base = sample(c('A','C','G','T'), 100, replace = T)

## To return the data structure of an object:
str(dna.base)
##  chr [1:100] "T" "T" "A" "C" "G" "A" "A" "G" "G" ...
## To return the data type of an object
mode(dna.base)
## [1] "character"
## To return the data class of an object
class(dna.base)
## [1] "character"
## You can also return information using question asking style
is.vector(dna.base)
## [1] TRUE
## The table() function is extremely useful to count the occurance of each bases
table(dna.base)
## dna.base
##  A  C  G  T 
## 31 19 32 18

alt text

Topics: Matrix

import matrix

Problem: given a mockup microarray data file with P-value at the final column, import it and try to log it

Create a folder called data within your directory and download the dataset mock.microarray into the folder

my project
  |____ data

#*#############
#*########
## Code Chunk 2


## Import mock microarray data
mock.microarray = read.table(file = './data/mock.microarray.txt')

## Let's look at what we have imported ...

head(mock.microarray)
##            Norm.1     Norm.2      Norm.3      Exp.1       Exp.2      Exp.3
## Gene-1  0.1074007 -0.7526990 -0.79300541 -0.7086270 -1.19897416 -0.3091510
## Gene-2 -0.1565274  1.0138777 -0.03723597 -1.5371134 -0.38818861  0.9436137
## Gene-3  0.9296591 -0.6382972  1.29158738 -0.7677627  0.80148739  1.4603099
## Gene-4  2.2465670 -2.1446552 -0.35200441  1.7099191 -0.39358089 -2.2632457
## Gene-5  0.0402311  0.2889409  1.11138109  0.1582708  0.04779674 -0.8960538
## Gene-6  0.3164832 -0.8551818  1.04832793  0.2914892  1.32650892  1.1784649
##        P.value
## Gene-1  0.2106
## Gene-2  0.7905
## Gene-3  0.2263
## Gene-4  0.2536
## Gene-5  0.0458
## Gene-6  0.6628
## Store data and p-value in separate variables
mock.data = mock.microarray[, 1:6]
mock.p = mock.microarray$P.value

## Log2 all the data
log.mock.data = log2(mock.data)

## Examine the logged data
head(log.mock.data)
##            Norm.1      Norm.2     Norm.3     Exp.1      Exp.2       Exp.3
## Gene-1 -3.2189248         NaN        NaN       NaN        NaN         NaN
## Gene-2        NaN  0.01988361        NaN       NaN        NaN -0.08373176
## Gene-3 -0.1052263         NaN 0.36914525       NaN -0.3192483  0.54627452
## Gene-4  1.1677221         NaN        NaN  0.773928        NaN         NaN
## Gene-5 -4.6355449 -1.79115348 0.15235360 -2.659533 -4.3869438         NaN
## Gene-6 -1.6597991         NaN 0.06809008 -1.778486  0.4076344  0.23690875

This is due to logging the negative values which is undefined. To remedy the situation, we will change all negative value to a constant value 0.001

#*#############
#*########
## Code Chunk 3


## Change negative values to 0.001
mock.data[mock.data < 0] = 0.001
log.mock.data = log2(mock.data)

## Examime the logged data again
head(log.mock.data)
##            Norm.1      Norm.2      Norm.3     Exp.1      Exp.2       Exp.3
## Gene-1 -3.2189248 -9.96578428 -9.96578428 -9.965784 -9.9657843 -9.96578428
## Gene-2 -9.9657843  0.01988361 -9.96578428 -9.965784 -9.9657843 -0.08373176
## Gene-3 -0.1052263 -9.96578428  0.36914525 -9.965784 -0.3192483  0.54627452
## Gene-4  1.1677221 -9.96578428 -9.96578428  0.773928 -9.9657843 -9.96578428
## Gene-5 -4.6355449 -1.79115348  0.15235360 -2.659533 -4.3869438 -9.96578428
## Gene-6 -1.6597991 -9.96578428  0.06809008 -1.778486  0.4076344  0.23690875

Now, this time it looks better

Learn to subset first 10 rows and first 5 columns

#*#############
#*########
## Code Chunk 4


subset.data = log.mock.data[1:10, 1:5]
head(subset.data)
##            Norm.1      Norm.2      Norm.3     Exp.1      Exp.2
## Gene-1 -3.2189248 -9.96578428 -9.96578428 -9.965784 -9.9657843
## Gene-2 -9.9657843  0.01988361 -9.96578428 -9.965784 -9.9657843
## Gene-3 -0.1052263 -9.96578428  0.36914525 -9.965784 -0.3192483
## Gene-4  1.1677221 -9.96578428 -9.96578428  0.773928 -9.9657843
## Gene-5 -4.6355449 -1.79115348  0.15235360 -2.659533 -4.3869438
## Gene-6 -1.6597991 -9.96578428  0.06809008 -1.778486  0.4076344

Sort the matrix based on p-value

#*#############
#*########
## Code Chunk 5


## sort P
sorted.p = sort(mock.p)
sorted.p.index = sort.int(mock.p, index.return = T)$ix


sorted.mock.data.based.on.p = log.mock.data[sorted.p.index,]

alt text

Topics: Table of Content

A professional document started with table of content. To add that on this docuemnt, add the following codes to the document meta attribute section between ---

Make sure you layout the attributes exactly indented as shown:

#*#############
#*########
## Code Chunk 6


output: 
  html_document:
      toc: true
      toc_depth: 3

Topics: List

Given a text file with 3 lines of differing data type and varying element count: numeric, integer, and character, all common separated. Create a list data structure to store them

Download the dataset mock.list into the data folder

#*#############
#*########
## Code Chunk 7


mock.list = readLines(con = './data/mock.list.txt')

mock.number = strsplit(mock.list[1], split = ',')
mock.number = unlist(mock.number)
mock.number = as.numeric(mock.number)
str(mock.number)
##  num [1:3] 100 200 301
mock.integer = strsplit(mock.list[2], split = ',')
mock.integer = unlist(mock.integer)
mock.integer = as.integer(mock.integer)
str(mock.number)
##  num [1:3] 100 200 301
mock.character = strsplit(mock.list[3], split = ',')
mock.character = unlist(mock.character)
mock.character = as.character(mock.character)
str(mock.character)
##  chr [1:7] "tom" "john" "mary" "peter" "joe" "jo" "tzu"
## Putting them together in a list object

mock.list.object = list(mock.number, mock.integer, mock.character)

## You can also name each list during the list creation

mock.list.object = list('mock_num'=mock.number, 'mock_int'=mock.integer, 'mock_chr'=mock.character)

## Assessing the list element

mock.list.object[[1]]
## [1] 100.1300 200.0900 300.6253
mock.list.object$mock_num
## [1] 100.1300 200.0900 300.6253

alt text

Topics: Data Frame

Download the dataset mock.dataframe into the data folder

#*#############
#*########
## Code Chunk 8


mock.data.frame = read.table('./data/mock.dataframe.txt')

str(mock.data.frame)
## 'data.frame':    6 obs. of  3 variables:
##  $ x.num     : num  100 200 301 388 773 ...
##  $ x.interger: int  3 6 9 12 53 100
##  $ x.chr     : Factor w/ 6 levels "joe","john","mary",..: 5 2 3 4 1 6
mode(mock.data.frame)
## [1] "list"
class(mock.data.frame)
## [1] "data.frame"
## Assessing elemnet in Data.Frame

mock.data.frame$x.num
## [1]  100.1300  200.0900  300.6253  387.8300  773.0000 9292.9760

alt text

Data Wrangling Utilities Function

The following code was adapted from R Wrokgroup 2015: Part 1. Please refer to the web document for a full description.

alt text

Subsetting data using the subset( ) function

subset() is a very efficient utility to extract/subset data from a matrix. The following figure shows how subset() use the content of column name col2 to extract data that satisfy the criteria

alt text

Download the dataset WordLearnEx into the data folder

#*#############
#*########
## Code Chunk 9


## Import sample data "WordLearnEx.txt"
WordLearnEx <- read.delim("./data/WordLearnEx.txt")

Examine WordLeanEx variable using RStudio Environment tab
For example, TP column has 2 levels values: High and Low

#*#############
#*########
## Code Chunk 10


## Extract all columns that satisfy TP == "High"
WL.hi <- subset(WordLearnEx, TP == "High")

You can also use multiple constraints. the following figure shows how you can extact rows that is greater than or equal to 6

alt text

#*#############
#*########
## Code Chunk 11


## USe the Accuracy column to filter
WL.mid <- subset(WordLearnEx, Accuracy > 0 & Accuracy < 1)

The select argument in subset() is very useful to select multiple columns by their column names

alt text

Note: install libraries
To install library use: install.packages(‘psych’)

#*#############
#*########
## Code Chunk 12


## We will be using the "affect" dataset in "psych" package
library(psych)
## Here an example
affect.neg <- subset(affect, select=c(Study, Film, NA1, NA2))

We are working on the affect variable
But, affect is no where to be found in the Environment
Why?

We can also combine both to perform more refine filtering

alt text

#*#############
#*########
## Code Chunk 13


affect.neg.maps <- subset(affect, Study == "maps", select=c(Study, Film, NA1, NA2))

Sometimes, you only want to remove one column

alt text

#*#############
#*########
## Code Chunk 14


## Here I want to select everything except the Accuracy column
WL.nacc <- subset(WordLearnEx, select = -Accuracy)

alt text

Use the merge() function to combine tables

Combining data togehter using similar names in a column is easy