Chapter 9 Use data.table Package for Big Data II

Here we will look at some advanced use of data.table. You will be amazed by the power of data.table for data analysis.

Again, we need to first load data.table package first.

library(data.table)

As a quick review, we have learned to use fread() to read csv file into R as data.table; or use data.table() to convert an existing data.frame into a data.table.

data.table provides a simple, consistent, user-friendly syntax for data manipulation. The general form of data.table syntax is:

DT[i, j, by]

where

DT is a data.table.
by: grouped by what?
j: what to do?
i: on which rows?

In this chapter,we will explore the advanced use of data.table, using the flights dataset as an illustration.

Let’s read flights.csv into R using fread() function in data.table package.

flights=fread("data/flights.csv")

9.1 Use Chaining to Avoid Intermediate Result

Let’s reconsider the task of getting the total number of flights for each origin, destination pair for carrier “AA”:

tmp <- flights[carrier == "AA", .(count=.N), by = .(origin, dest)]

How can we order tmp using the columns origin in ascending order, and destination in descending order? We can store the intermediate result in a data.table, and then apply order(origin, -dest) on that data.table. It seems fairly straightforward.

tmp <- tmp[order(origin, -dest)]
head(tmp,8)

##    origin dest count
## 1:    EWR  MIA  1068
## 2:    EWR  LAX   365
## 3:    EWR  DFW  2054
## 4:    JFK  TPA   311
## 5:    JFK  STT   303
## 6:    JFK  SJU  1099
## 7:    JFK  SFO  1422
## 8:    JFK  SEA   365

This requires having to create an intermediate data.table and then overwriting that data.table. The intermediate data.table has no other use. When you are working on big data, these intermediate data will soon consumes your computer memory and makes the code slow to run.

We can do better and avoid this intermediate data.table altogether by chaining expressions. For the above task, we can:

tmp <- flights[carrier == "AA", .(count=.N), by = .(origin, dest)][order(origin, -dest)]

# print the first 8 rows of ans
head(tmp, 8)

##    origin dest count
## 1:    EWR  MIA  1068
## 2:    EWR  LAX   365
## 3:    EWR  DFW  2054
## 4:    JFK  TPA   311
## 5:    JFK  STT   303
## 6:    JFK  SJU  1099
## 7:    JFK  SFO  1422
## 8:    JFK  SEA   365

Let’s look at this code. [carrier == “AA”, .(count=.N), by = .(origin, dest)] is first executed; followed by [order(origin, -dest)]. We do not need to create an intermediate result.

In general, we can tack expressions one after another, forming a chain of operations, i.e., DT[ … ][ … ][ … ].

The chaining operation is very important in R to speedup the R code and has been implemented in many other packages, e.g., the pipe, %>%, from the magrittr package. We will take about the chaining with pipe operator %>% in later chapters.

Let’s look at the another example: Find the origin, dest pair with the longest average air_time.

flights[,.(avg_air_time=mean(air_time)), by=.(origin, dest)][order(-avg_air_time)][1:3]

##    origin dest avg_air_time
## 1:    JFK  HNL     623.0877
## 2:    EWR  ANC     413.1250
## 3:    JFK  JAC     275.5000

The result shows that the longest flight pair is JFK-HNL (Hawaii). Chaining allows you to answer questions like this with one simple line of code! Really cool!

Now, let’s spend sometime examine the code. The first expression [,.(avg_air_time=mean(air_time)), by=.(origin, dest)] is excuted first, and the new data.table is created in the memory. Since avg_air_time is created, the second expression [order(-avg_air_time)] can use this new variable avg_air_time for sorting. The third expression then filtering the top three observations.

9.2 uniqueN()

uniqueN() is a handy function that returns an integer value containing the number of unique values in the input object. It accepts vectors as well as data.frames and data.tables as its argument.

E.g., we want to find out how many unique id in this vector: student_id=c(1,2,3,4,1,2)

student_id=c(1,2,3,4,1,2)
uniqueN(student_id)

## [1] 4

In data.table, we can use uniqueN() together with by to get summary statistics such as: how many direct flights are at each origin airport? In other words, how many unique destination for each origin?

flights[, uniqueN(dest), by=.(origin)]

##    origin V1
## 1:    EWR 86
## 2:    LGA 68
## 3:    JFK 70

Thus, we know that EWR airport has direct flights to 86 other airports.

9.3 Subset of Data: .SD[]

.SD is a special symbol which stands for Subset of Data. It contains subset of data corresponding to each group; which itself is a data.table.

Why we need this? Let’s look at one example: You want to the find out, for each origin, dest pair, which flight has the longest arr_delay? Intuitively, we know you should sort the data according to origin, dest, arr_delay.

flights[order(origin,dest,-arr_delay)]

To help you see result of the above code, let’s print the first few rows and a few columns (origin, dest, month, day, arr_delay).

head( flights[order(origin,dest,-arr_delay)][,.(origin,dest,month,day,arr_delay)] )

##    origin dest month day arr_delay
## 1:    EWR  ALB     1  25       328
## 2:    EWR  ALB    12   5       300
## 3:    EWR  ALB     1  31       268
## 4:    EWR  ALB     3   8       263
## 5:    EWR  ALB     2  26       217
## 6:    EWR  ALB     3  15       190

But how to retrieve the top row for each origin, dest pair? .SD[] is designed for that purpose. As mentioned, .SD denotes the data.table which contains subset of data corresponding to each group. We can use .SD[i] to retrieve the i-th row from the data.table.

# use .SD[] to retrieve the top row for each group.
flights[order(origin,dest,-arr_delay)][, .SD[1], by=.(origin,dest)]

We can use .SDcols to select columns contained in .SD.

# use .SD[] to select top row for each group and select month, day, arr_delay
flights[order(origin,dest,-arr_delay)][, .SD[1], by=.(origin,dest),.SDcols=c("month","day","arr_delay")]

You can also select the top n rows for each group using .SD

# use .SD[] to select top three row for each group and select month, day, arr_delay
flights[order(origin,dest,-arr_delay)][, .SD[1:3], by=.(origin,dest),.SDcols=c("month","day","arr_delay")]

Using is .SD[] provides an easy way to identify the outlier (or the most interesting data point) in our data for each group. This can help us to answer question such as: what is the top sales person for each region? which county has the top population for each state?

9.4 Use := to Add/Update Columns By Reference

data.table defines a new operator := for adding/updating columns by reference.

The syntax of := is as follow: LHS := RHS

E.g., we want to create a new variable called total_delay, which equals to arr_delay+arr_delay. We can use the following code:

flights[,total_delay:=arr_delay+arr_delay]
str(flights)

## Classes 'data.table' and 'data.frame':   336776 obs. of  20 variables:
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : int  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : int  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
##  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : int  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : int  1400 1416 1089 1576 762 719 1065 229 944 733 ...
##  $ hour          : int  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : int  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
##  $ total_delay   : int  22 40 66 -36 -50 24 38 -28 -16 16 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "index")= int(0) 
##   ..- attr(*, "__carrier")= int [1:336776] 117 428 429 434 452 482 496 501 504 506 ...

We will see the variable total_delay is created.

You may be wondering what “updating by reference” means. Under the data.frame, when we need to update a column, R will read the whole data.frame into memory, update the whole data.frame and write the updated data.frame into a new object. This is really inefficient. Under data.table with :=, we do not need to updated the whole data.table object; Instead, we just refer the new column to the computer memory address that contains the new column. Therefore, := is much more efficient. The efficiency difference is particular evident when working with big data.

You can also create multiple new columns simultaneously using := . E.g., you need to create a total_delay and speed=distance/(air_time/60). Note that speed is measured by miles per hour.

flights[,c("total_delay","speed"):=.(arr_delay+arr_delay,  distance/air_time*60)]
str(flights)

## Classes 'data.table' and 'data.frame':   336776 obs. of  21 variables:
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : int  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : int  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
##  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : int  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : int  1400 1416 1089 1576 762 719 1065 229 944 733 ...
##  $ hour          : int  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : int  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
##  $ total_delay   : int  22 40 66 -36 -50 24 38 -28 -16 16 ...
##  $ speed         : num  370 374 408 517 394 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "index")= int(0) 
##   ..- attr(*, "__carrier")= int [1:336776] 117 428 429 434 452 482 496 501 504 506 ...

Note that, on the LHS of :=, i.e., c(“total_delay”,“speed”), we used variable name inside "" because these variables are not yet defined in the data.table and thus cannot be referred as variables directly.

We can combine := with by group argument to create new variables by groups. E.g., we want to add a new column avg_air_time which is the average air_time for the origin, dest pair.

tmp=flights[, avg_air_time:=mean(air_time, na.rm = TRUE), by = .(origin, dest)]
head(tmp)

##    year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1: 2013     1   1      517            515         2      830            819
## 2: 2013     1   1      533            529         4      850            830
## 3: 2013     1   1      542            540         2      923            850
## 4: 2013     1   1      544            545        -1     1004           1022
## 5: 2013     1   1      554            600        -6      812            837
## 6: 2013     1   1      554            558        -4      740            728
##    arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1:        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
## 2:        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
## 3:        33      AA   1141  N619AA    JFK  MIA      160     1089    5     40
## 4:       -18      B6    725  N804JB    JFK  BQN      183     1576    5     45
## 5:       -25      DL    461  N668DN    LGA  ATL      116      762    6      0
## 6:        12      UA   1696  N39463    EWR  ORD      150      719    5     58
##              time_hour total_delay    speed avg_air_time
## 1: 2013-01-01 05:00:00          22 370.0441     195.8404
## 2: 2013-01-01 05:00:00          40 374.2731     200.7932
## 3: 2013-01-01 05:00:00          66 408.3750     152.1300
## 4: 2013-01-01 05:00:00         -36 516.7213     194.2411
## 5: 2013-01-01 06:00:00         -50 394.1379     113.5550
## 6: 2013-01-01 05:00:00          24 287.6000     113.2603

Note that we created a new variable as the average air time.

Note that we can also compute on j by group as we illustrate in previous chapter (shown in the code below. However, we cannot add this calculation as a new column to the original data.table as we did with :=.

tmp=flights[, .(avg_air_time=mean(air_time, na.rm = TRUE)), by = .(origin, dest)]
head(tmp)

##    origin dest avg_air_time
## 1:    EWR  IAH     195.8404
## 2:    LGA  IAH     200.7932
## 3:    JFK  MIA     152.1300
## 4:    JFK  BQN     194.2411
## 5:    LGA  ATL     113.5550
## 6:    EWR  ORD     113.2603

9.5 Binning a continuous value into category

In many application, we want to convert a continuous value into levels. E.g., we want to classify flights into short/median/long flight based on the average distance. Specifically, if average distance<=520 (1st Qu), short; else if average distance<=1389 (3rd Qu.), median; else long.

distance_level <- function(distance) {
  avg_distance=mean(distance, na.rm = TRUE)
  if (avg_distance <= 520) "short"
  else if (avg_distance <= 1389) "medium"
  else "long"
}

tmp=flights[, distance_level:= .( distance_level(distance) ), by=.(origin, dest)]
head(tmp)

##    year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1: 2013     1   1      517            515         2      830            819
## 2: 2013     1   1      533            529         4      850            830
## 3: 2013     1   1      542            540         2      923            850
## 4: 2013     1   1      544            545        -1     1004           1022
## 5: 2013     1   1      554            600        -6      812            837
## 6: 2013     1   1      554            558        -4      740            728
##    arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1:        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
## 2:        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
## 3:        33      AA   1141  N619AA    JFK  MIA      160     1089    5     40
## 4:       -18      B6    725  N804JB    JFK  BQN      183     1576    5     45
## 5:       -25      DL    461  N668DN    LGA  ATL      116      762    6      0
## 6:        12      UA   1696  N39463    EWR  ORD      150      719    5     58
##              time_hour total_delay    speed avg_air_time distance_level
## 1: 2013-01-01 05:00:00          22 370.0441     195.8404           long
## 2: 2013-01-01 05:00:00          40 374.2731     200.7932           long
## 3: 2013-01-01 05:00:00          66 408.3750     152.1300         medium
## 4: 2013-01-01 05:00:00         -36 516.7213     194.2411           long
## 5: 2013-01-01 06:00:00         -50 394.1379     113.5550         medium
## 6: 2013-01-01 05:00:00          24 287.6000     113.2603         medium

9.6 Expressions in by

Recall the data.table syntax DT[i, j, by]. Like argument j, argument by also accepts expressions? As an example, if we would like to find out how many flights departed late but arrived early, departed and arrived late, departed early and arrived early, departed early but arrived late:

flights[, .(count=.N), by=.(dep_late=dep_delay>0, arr_late=arr_delay>0)]

##    dep_late arr_late  count
## 1:     TRUE     TRUE  92303
## 2:    FALSE    FALSE 158900
## 3:    FALSE     TRUE  40701
## 4:     TRUE    FALSE  35442
## 5:    FALSE       NA    488
## 6:     TRUE       NA    687
## 7:       NA       NA   8255

We can further show calcuate count as percentage of total.

flights[, .(count=.N), by=.(dep_late=dep_delay>0, arr_late=arr_delay>0)][, percent:=count/sum(count)][]

##    dep_late arr_late  count     percent
## 1:     TRUE     TRUE  92303 0.274078319
## 2:    FALSE    FALSE 158900 0.471826971
## 3:    FALSE     TRUE  40701 0.120854812
## 4:     TRUE    FALSE  35442 0.105239091
## 5:    FALSE       NA    488 0.001449034
## 6:     TRUE       NA    687 0.002039932
## 7:       NA       NA   8255 0.024511842

As seen, 27.4% of the flights does not depart nor arrive late. About half of flights depart and arrive on time. This is a very useful and quick way to slice data into different groups and compute summary statistics accordingly.

9.7 Summary

The basic data.table syntax DT[ i,j, by] is like a building block and can be chained DT[i, j, by][…][…] to avoid intermediate results.
Use uniqueN() to count the unqiue value in a group
Use .SD[ ] to select the top n rows in a group
Use := to add new columns in a data.table
Convert a continuous value into category
Use expression in by