- Introduction
- Big Data Import: A Better Solution
- Learning
`data.table`

package - Learning to map data
- Medicaid Healthcare Payment Data Analysis
- Homework

This lecture touches on `big-data`

analytics. We will cover:

- How to deal with
`big-data`

, and why conventional R methods are insufficient - Introduction to the
`data.table`

library for`big-data`

analytics - A real case study: Medicare Healthcare Payment

Today’s science is becoming `big-data`

science and learning how to deal with these data type is crucial for future data analysis.

`data.table()`

We explore using conventional R import/export methods to handle a big data table with *1 million* rows and 28 columns.

```
library(data.table)
## Base R method
ptm = proc.time()
read.table.demo = read.table(file = "./data/demo.data.txt")
read.table.time.ellapsed = proc.time() - ptm
cat("Ellapsed time is ", read.table.time.ellapsed[3]/60, " minutes\n")
## data.table package method
ptm = proc.time()
fread.demo = fread(input = "./data/demo.data.txt", sep = "\t")
fread.time.ellapsed = proc.time() - ptm
cat("Ellapsed time is ", fread.time.ellapsed[3]/60, " minutes\n")
```

Compare to the conventional method `read.table( )`

which took 1.04035 minutes, `fread( )`

from the `data.table`

package only took 0.05445 minutes.

`data.table`

packageThis segment was adapted from `data.table`

introductory material in PDF

You can find much more resources on `data.table`

from its home page

Also, a very helpful `data.table`

cheat sheet

`data.frame`

vs. `data.table`

We can create a `data.frame`

using the `data.frame( )`

function

```
## Create a simple data frame
DF = data.frame(x = c("b", "b", "b", "a", "a"), v = rnorm(5))
DF
```

```
## x v
## 1 b -0.7396150
## 2 b -0.4823626
## 3 b 0.2977370
## 4 a 1.8230478
## 5 a 0.6615337
```

In the same token, `data.table`

is create in a similar fashion

```
DT = data.table(x = c("b", "b", "b", "a", "a"), v = rnorm(5))
DT
```

```
## x v
## 1: b -0.42902437
## 2: b 0.42517770
## 3: b 0.10816124
## 4: a 0.01262858
## 5: a -0.16755428
```

Notice:

`data.table`

prints the row numbers with colon to distinguish itself from`data.frame`

`data.table`

objects in memory`data.table`

has a very nice utility `tables()`

to summarize existing `data.table`

object in memory

`tables()`

```
## NAME NROW NCOL MB
## [1,] DT 5 2 1
## [2,] fread.demo 1,000,000 28 214
## COLS
## [1,] x,v
## [2,] npi,nppes_provider_last_org_name,nppes_provider_first_name,nppes_provider_mi,npp
## KEY
## [1,]
## [2,]
## Total: 215MB
```

Notice:

The column`MB`

shows how much memory, in MegaBytes, used by each`data.table`

object

`apply`

function onto `data.table`

objectSimilar to `data.frame`

, we can also apply utility function such as the `apply`

family to the `data.table`

object.

`sapply(DT, class)`

```
## x v
## "character" "numeric"
```

`Keys`

in `data.table`

Instead of using `rownames`

in `data.frame`

, `data.table`

uses `key`

. `key`

has the following properties:

- a
`key`

consists of one or more columns?? (might not be true) - The columns may be integer, factor, number or character
- The row in
`data.table`

is sorted by this`key`

- Uniqueness is not enforced; i.e., duplicate key values are allowed (not true for
`rows`

in`data.frame`

)

The `key`

can be observed in the `tables()`

output. In this case, no `key`

is being defined yet in both `data.table`

objects.

`tables()`

```
## NAME NROW NCOL MB
## [1,] DT 5 2 1
## [2,] fread.demo 1,000,000 28 214
## COLS
## [1,] x,v
## [2,] npi,nppes_provider_last_org_name,nppes_provider_first_name,nppes_provider_mi,npp
## KEY
## [1,]
## [2,]
## Total: 215MB
```

`key`

Ever without setting `key`

, `data.table`

can still be subsetted.

```
## Select row 2
DT[2, ]
```

```
## x v
## 1: b 0.4251777
```

```
## Select rows where column x == 'b'
DT[x == "b", ]
```

```
## x v
## 1: b -0.4290244
## 2: b 0.4251777
## 3: b 0.1081612
```

```
## However, you can not do this:
cat(try(DT["b", ], silent = T))
```

```
## Error in `[.data.table`(DT, "b", ) :
## When i is a data.table (or character vector), x must be keyed (i.e. sorted, and, marked as sorted) so data.table knows which columns to join to and take advantage of x being sorted. Call setkey(x,...) first, see ?setkey.
```

`key`

Setting `key`

simplify the subsetting process

```
## This is how to set key Here, we set the key to columne 'x'
setkey(DT, x)
## This can be observed when calling tables()
tables()
```

```
## NAME NROW NCOL MB
## [1,] DT 5 2 1
## [2,] fread.demo 1,000,000 28 214
## COLS
## [1,] x,v
## [2,] npi,nppes_provider_last_org_name,nppes_provider_first_name,nppes_provider_mi,npp
## KEY
## [1,] x
## [2,]
## Total: 215MB
```

```
## Now, subsetting column x of 'b', is simply
DT["b", ]
```

```
## x v
## 1: b -0.4290244
## 2: b 0.4251777
## 3: b 0.1081612
```

`data.table`

: The Full NomenclatureTo subset the conventional `data.frame`

we need 2 arguments; `col`

and `row`

. In contract, to subset `data.table`

, we need to know three arguments. These 3 arguments follow `SQL`

convention.

This image was extracted from Matt Dowle talk

The following segment was adapted from `data.table`

cheat sheet

This segment will demostrate how to use each `i`

, `j`

, and `by`

arguments in `data.table`

- Subsetting rows using
`i`

- Manipulating on columns in
`j`

- Doing
`j`

`by`

Group

`data.table`

objectCreate a `data.table`

object and call it `DT`

```
set.seed(45L)
DT <- data.table(V1 = c(1L, 2L), V2 = LETTERS[1:3], V3 = round(rnorm(4), 4),
V4 = 1:12)
```

Questoin:

How do you know if this is a`data.table`

object, and not`data.frame`

?

`i`

argument```
## Subsetting rows by numbers For example: select third to fifth row
DT[3:5, ] ## or DT[3:5]
```

```
## V1 V2 V3 V4
## 1: 1 C -0.3795 3
## 2: 2 A -0.7460 4
## 3: 1 B 0.3408 5
```

```
## Use column names to select rows in i based on a condition using fast
## automatic indexing. For example: selects all rows that have value 'A' in
## column 'V2'
DT[V2 == "A"]
```

```
## V1 V2 V3 V4
## 1: 1 A 0.3408 1
## 2: 2 A -0.7460 4
## 3: 1 A -0.3795 7
## 4: 2 A -0.7033 10
```

```
## For example: selects all rows that have the value 'A' or 'C' in column
## 'V2'
DT[V2 %in% c("A", "C")]
```

```
## V1 V2 V3 V4
## 1: 1 A 0.3408 1
## 2: 1 C -0.3795 3
## 3: 2 A -0.7460 4
## 4: 2 C -0.7033 6
## 5: 1 A -0.3795 7
## 6: 1 C 0.3408 9
## 7: 2 A -0.7033 10
## 8: 2 C -0.7460 12
```

`j`

```
## Select 1 column in 'j' For example: Column V2 is return as a vector
DT[, V2]
```

`## [1] "A" "B" "C" "A" "B" "C" "A" "B" "C" "A" "B" "C"`

```
## Select several columns in 'j' For example: Columns V2 and V3 are returned
## as a data.table Note: .() is an alias to list() if .() is used, the
## returned value is a data.table if .() is not used, the result is a vector
DT[, .(V2, V3)]
```

```
## V2 V3
## 1: A 0.3408
## 2: B -0.7033
## 3: C -0.3795
## 4: A -0.7460
## 5: B 0.3408
## 6: C -0.7033
## 7: A -0.3795
## 8: B -0.7460
## 9: C 0.3408
## 10: A -0.7033
## 11: B -0.3795
## 12: C -0.7460
```

```
## Call function in j For example: Returns the sum of all elements of column
## V1 in a vector
DT[, sum(V1)]
```

`## [1] 18`

```
## Computing on several columns Return the sum of all elements of volumn V1
## and the standard deviation of V3 in data.table
DT[, .(sum(V1), sd(V3))]
```

```
## V1 V2
## 1: 18 0.4546055
```

```
## Assigning column names to computed columns For example: The same as above,
## but with new names
DT[, .(Aggregate = sum(V1), Sd.V3 = sd(V3))]
```

```
## Aggregate Sd.V3
## 1: 18 0.4546055
```

```
## Columns get recycled if different length Select column V1, and compute
## std. dev. of V3, which return a single value and gets recycled
DT[, .(V1, Sd.V3 = sd(V3))]
```

```
## V1 Sd.V3
## 1: 1 0.4546055
## 2: 2 0.4546055
## 3: 1 0.4546055
## 4: 2 0.4546055
## 5: 1 0.4546055
## 6: 2 0.4546055
## 7: 1 0.4546055
## 8: 2 0.4546055
## 9: 1 0.4546055
## 10: 2 0.4546055
## 11: 1 0.4546055
## 12: 2 0.4546055
```

```
## Multiple expressions can be wrapped in curly braces For example: Print
## column V2 and plot V3
DT[, {
print(V2)
plot(V3)
NULL
}]
```

`## [1] "A" "B" "C" "A" "B" "C" "A" "B" "C" "A" "B" "C"`

`## NULL`

`j`

`by`

Group```
## Doing j by group For example: Calculate the sum of V4, for every group in
## V1
DT[, .(V4.Sum = sum(V4)), by = V1]
```

```
## V1 V4.Sum
## 1: 1 36
## 2: 2 42
```

```
## Doing j by several groups using .() For example: the same as above, but
## for every group in V1 and V2
DT[, .(V4.Sum = sum(V4)), by = .(V1, V2)]
```

```
## V1 V2 V4.Sum
## 1: 1 A 8
## 2: 2 B 10
## 3: 1 C 12
## 4: 2 A 14
## 5: 1 B 16
## 6: 2 C 18
```

```
## Call functions in 'by' For example: Calculates the sum of V4, for every
## group in sign(V1-1)
DT[, .(V4.Sum = sum(V4)), by = sign(V1 - 1)]
```

```
## sign V4.Sum
## 1: 0 36
## 2: 1 42
```

```
## Assigning new column names in 'by' For example: same as above, but with a
## new name for the variable we are grouping by.
DT[, .(V4.Sum = sum(V4)), by = .(V1.01 = sign(V1 - 1))]
```

```
## V1.01 V4.Sum
## 1: 0 36
## 2: 1 42
```

```
## Grouping only on a subset by specifying i For example: Calculate the sum
## of V4, for every group in V1, after subsetting on the first five rows
DT[1:5, .(V4.Sum = sum(V4)), by = V1]
```

```
## V1 V4.Sum
## 1: 1 9
## 2: 2 6
```

You are encourage to learn more about how to use the `data.table`

object from the `data.table`

cheat sheet

This lesson in this segment was adapted from this book; “R Graphics CookBook, by Winston Chang”

```
## from R Graphic Cook Book
library(ggplot2)
library(maps)
library(mapproj)
## Get map data from USA Note: 'map_data( )' is actually from 'ggplot2'
## package
states_map = map_data("state")
## Plot the USA state map
ggplot(states_map, aes(x = long, y = lat, group = group)) + geom_polygon(fill = "white",
colour = "black")
```