Introduction

Lecture Sypnosis

This lecture touches on big-data analytics. We will cover:

  • How to deal with big-data, and why conventional R methods are insufficient
  • Introduction to the data.table library for big-data analytics
  • A real case study: Medicare Healthcare Payment

Big Data Import: A Better Solution

Today’s science is becoming big-data science and learning how to deal with these data type is crucial for future data analysis.

Import big data with data.table()

We explore using conventional R import/export methods to handle a big data table with 1 million rows and 28 columns.

library(data.table)

## Base R method
ptm = proc.time()
read.table.demo = read.table(file = "./data/demo.data.txt")
read.table.time.ellapsed = proc.time() - ptm
cat("Ellapsed time is ", read.table.time.ellapsed[3]/60, " minutes\n")


## data.table package method
ptm = proc.time()
fread.demo = fread(input = "./data/demo.data.txt", sep = "\t")
fread.time.ellapsed = proc.time() - ptm
cat("Ellapsed time is ", fread.time.ellapsed[3]/60, " minutes\n")

Compare to the conventional method read.table( ) which took 1.04035 minutes, fread( ) from the data.table package only took 0.05445 minutes.

Learning data.table package

This segment was adapted from data.table introductory material in PDF

You can find much more resources on data.table from its home page

Also, a very helpful data.table cheat sheet

Creating data.frame vs. data.table

We can create a data.frame using the data.frame( ) function

## Create a simple data frame
DF = data.frame(x = c("b", "b", "b", "a", "a"), v = rnorm(5))
DF
##   x          v
## 1 b -0.7396150
## 2 b -0.4823626
## 3 b  0.2977370
## 4 a  1.8230478
## 5 a  0.6615337

In the same token, data.table is create in a similar fashion

DT = data.table(x = c("b", "b", "b", "a", "a"), v = rnorm(5))
DT
##    x           v
## 1: b -0.42902437
## 2: b  0.42517770
## 3: b  0.10816124
## 4: a  0.01262858
## 5: a -0.16755428

Notice:
data.table prints the row numbers with colon to distinguish itself from data.frame

Listing data.table objects in memory

data.table has a very nice utility tables() to summarize existing data.table object in memory

tables()
##      NAME            NROW NCOL  MB
## [1,] DT                 5    2   1
## [2,] fread.demo 1,000,000   28 214
##      COLS                                                                            
## [1,] x,v                                                                             
## [2,] npi,nppes_provider_last_org_name,nppes_provider_first_name,nppes_provider_mi,npp
##      KEY
## [1,]    
## [2,]    
## Total: 215MB

Notice:
The column MB shows how much memory, in MegaBytes, used by each data.table object

Apply apply function onto data.table object

Similar to data.frame, we can also apply utility function such as the apply family to the data.table object.

sapply(DT, class)
##           x           v 
## "character"   "numeric"

Keys in data.table

Instead of using rownames in data.frame, data.table uses key. key has the following properties:

  • a key consists of one or more columns?? (might not be true)
  • The columns may be integer, factor, number or character
  • The row in data.table is sorted by this key
  • Uniqueness is not enforced; i.e., duplicate key values are allowed (not true for rows in data.frame)

The key can be observed in the tables() output. In this case, no key is being defined yet in both data.table objects.

tables()
##      NAME            NROW NCOL  MB
## [1,] DT                 5    2   1
## [2,] fread.demo 1,000,000   28 214
##      COLS                                                                            
## [1,] x,v                                                                             
## [2,] npi,nppes_provider_last_org_name,nppes_provider_first_name,nppes_provider_mi,npp
##      KEY
## [1,]    
## [2,]    
## Total: 215MB

Subsetting without key

Ever without setting key, data.table can still be subsetted.

## Select row 2
DT[2, ]
##    x         v
## 1: b 0.4251777
## Select rows where column x == 'b'
DT[x == "b", ]
##    x          v
## 1: b -0.4290244
## 2: b  0.4251777
## 3: b  0.1081612
## However, you can not do this:
cat(try(DT["b", ], silent = T))
## Error in `[.data.table`(DT, "b", ) : 
##   When i is a data.table (or character vector), x must be keyed (i.e. sorted, and, marked as sorted) so data.table knows which columns to join to and take advantage of x being sorted. Call setkey(x,...) first, see ?setkey.

Subsetting with key

Setting key simplify the subsetting process

## This is how to set key Here, we set the key to columne 'x'
setkey(DT, x)

## This can be observed when calling tables()
tables()
##      NAME            NROW NCOL  MB
## [1,] DT                 5    2   1
## [2,] fread.demo 1,000,000   28 214
##      COLS                                                                            
## [1,] x,v                                                                             
## [2,] npi,nppes_provider_last_org_name,nppes_provider_first_name,nppes_provider_mi,npp
##      KEY
## [1,] x  
## [2,]    
## Total: 215MB
## Now, subsetting column x of 'b', is simply
DT["b", ]
##    x          v
## 1: b -0.4290244
## 2: b  0.4251777
## 3: b  0.1081612

data.table: The Full Nomenclature

To subset the conventional data.frame we need 2 arguments; col and row. In contract, to subset data.table, we need to know three arguments. These 3 arguments follow SQL convention.

alt text
This image was extracted from Matt Dowle talk

alt text

The following segment was adapted from data.table cheat sheet

Segment sypnosis

This segment will demostrate how to use each i, j, and by arguments in data.table

  • Subsetting rows using i
  • Manipulating on columns in j
  • Doing j by Group

Create data.table object

Create a data.table object and call it DT

set.seed(45L)
DT <- data.table(V1 = c(1L, 2L), V2 = LETTERS[1:3], V3 = round(rnorm(4), 4), 
    V4 = 1:12)

Questoin:
How do you know if this is a data.table object, and not data.frame?

Subsetting Rows Using the i argument

## Subsetting rows by numbers For example: select third to fifth row
DT[3:5, ]  ## or DT[3:5]
##    V1 V2      V3 V4
## 1:  1  C -0.3795  3
## 2:  2  A -0.7460  4
## 3:  1  B  0.3408  5
## Use column names to select rows in i based on a condition using fast
## automatic indexing.  For example: selects all rows that have value 'A' in
## column 'V2'
DT[V2 == "A"]
##    V1 V2      V3 V4
## 1:  1  A  0.3408  1
## 2:  2  A -0.7460  4
## 3:  1  A -0.3795  7
## 4:  2  A -0.7033 10
## For example: selects all rows that have the value 'A' or 'C' in column
## 'V2'
DT[V2 %in% c("A", "C")]
##    V1 V2      V3 V4
## 1:  1  A  0.3408  1
## 2:  1  C -0.3795  3
## 3:  2  A -0.7460  4
## 4:  2  C -0.7033  6
## 5:  1  A -0.3795  7
## 6:  1  C  0.3408  9
## 7:  2  A -0.7033 10
## 8:  2  C -0.7460 12

Manipulating On Columns in j

## Select 1 column in 'j' For example: Column V2 is return as a vector
DT[, V2]
##  [1] "A" "B" "C" "A" "B" "C" "A" "B" "C" "A" "B" "C"
## Select several columns in 'j' For example: Columns V2 and V3 are returned
## as a data.table Note: .() is an alias to list() if .() is used, the
## returned value is a data.table if .() is not used, the result is a vector
DT[, .(V2, V3)]
##     V2      V3
##  1:  A  0.3408
##  2:  B -0.7033
##  3:  C -0.3795
##  4:  A -0.7460
##  5:  B  0.3408
##  6:  C -0.7033
##  7:  A -0.3795
##  8:  B -0.7460
##  9:  C  0.3408
## 10:  A -0.7033
## 11:  B -0.3795
## 12:  C -0.7460
## Call function in j For example: Returns the sum of all elements of column
## V1 in a vector
DT[, sum(V1)]
## [1] 18
## Computing on several columns Return the sum of all elements of volumn V1
## and the standard deviation of V3 in data.table
DT[, .(sum(V1), sd(V3))]
##    V1        V2
## 1: 18 0.4546055
## Assigning column names to computed columns For example: The same as above,
## but with new names
DT[, .(Aggregate = sum(V1), Sd.V3 = sd(V3))]
##    Aggregate     Sd.V3
## 1:        18 0.4546055
## Columns get recycled if different length Select column V1, and compute
## std. dev. of V3, which return a single value and gets recycled
DT[, .(V1, Sd.V3 = sd(V3))]
##     V1     Sd.V3
##  1:  1 0.4546055
##  2:  2 0.4546055
##  3:  1 0.4546055
##  4:  2 0.4546055
##  5:  1 0.4546055
##  6:  2 0.4546055
##  7:  1 0.4546055
##  8:  2 0.4546055
##  9:  1 0.4546055
## 10:  2 0.4546055
## 11:  1 0.4546055
## 12:  2 0.4546055
## Multiple expressions can be wrapped in curly braces For example: Print
## column V2 and plot V3
DT[, {
    print(V2)
    plot(V3)
    NULL
}]
##  [1] "A" "B" "C" "A" "B" "C" "A" "B" "C" "A" "B" "C"

## NULL

Doing j by Group

## Doing j by group For example: Calculate the sum of V4, for every group in
## V1
DT[, .(V4.Sum = sum(V4)), by = V1]
##    V1 V4.Sum
## 1:  1     36
## 2:  2     42
## Doing j by several groups using .() For example: the same as above, but
## for every group in V1 and V2
DT[, .(V4.Sum = sum(V4)), by = .(V1, V2)]
##    V1 V2 V4.Sum
## 1:  1  A      8
## 2:  2  B     10
## 3:  1  C     12
## 4:  2  A     14
## 5:  1  B     16
## 6:  2  C     18
## Call functions in 'by' For example: Calculates the sum of V4, for every
## group in sign(V1-1)
DT[, .(V4.Sum = sum(V4)), by = sign(V1 - 1)]
##    sign V4.Sum
## 1:    0     36
## 2:    1     42
## Assigning new column names in 'by' For example: same as above, but with a
## new name for the variable we are grouping by.
DT[, .(V4.Sum = sum(V4)), by = .(V1.01 = sign(V1 - 1))]
##    V1.01 V4.Sum
## 1:     0     36
## 2:     1     42
## Grouping only on a subset by specifying i For example: Calculate the sum
## of V4, for every group in V1, after subsetting on the first five rows
DT[1:5, .(V4.Sum = sum(V4)), by = V1]
##    V1 V4.Sum
## 1:  1      9
## 2:  2      6

You are encourage to learn more about how to use the data.table object from the data.table cheat sheet

Learning to map data

This lesson in this segment was adapted from this book; “R Graphics CookBook, by Winston Chang”

alt text

## from R Graphic Cook Book

library(ggplot2)
library(maps)
library(mapproj)

## Get map data from USA Note: 'map_data( )' is actually from 'ggplot2'
## package
states_map = map_data("state")

## Plot the USA state map
ggplot(states_map, aes(x = long, y = lat, group = group)) + geom_polygon(fill = "white", 
    colour = "black")