Chapter 8 Use data.table Package for Big Data I

Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data. There are currently 120,000 packages in R, which empower you to do all kinds of data analysis. This huge variety of packages is one of the reasons that R is so successful: the chances are that someone has already solved a problem that you’re working on, and you can benefit from their work by downloading their package.

The data.table package in R provides an enhanced version of data.frame that allows you to do blazing fast data manipulations. The data.table package is being used in different fields such as finance and genomics which encounter large data sets (for example, 1GB to 100GB).

Why we are interested in learning data.table package among all these packages?

“The data.table package in a sense helped to ‘save’ R. At one point a few years ago, there were major concerns that R could not handle Big Data, with calls from some in favor of Python instead. The data.table package showed that R could do quite well in large datasets. This really aided in R’s being adopted by various large companies…” — Norm Matloff, Prof. of Computer Science, UC Davis

8.1 Install R Packages

There are over 120,000 packages in R. As you can imagine, it is not wise to install every package in R, which will take enormous computer storage space. In fact, when installing R, we will install the most essential packages, which are known as base R. For example, the functions learned so far are all from base R (e.g., data.frame, plot, loop, if …). Base R can perform the basic functions. But we can enhance the power of R by installing packages.

We can install R packages as needed through the following syntax:

install.packages("package_name") # download and install the package
library(package_name)            # load the package into memory

The packages are typically saved at the Comprehensive R Archive Network (CRAN); we do not need to worry about where the package is saved exactly on the Internet; R will take care of that automatically.

Let’s install the data.table package by running following code:

install.packages("data.table")  

As you will see, downloading and installing the package takes times, especially when the package is big. Thus, I will run this code just once to install the package. Every time when you reopen your R project, you just need to load the package into your memory through the following code:

library("data.table")
## Warning: package 'data.table' was built under R version 4.0.5
## 
## Attaching package: 'data.table'
## The following object is masked _by_ '.GlobalEnv':
## 
##     .N

Now that you have loaded the package into the computer memory, you can use all the powerful functions from the package.

On the right bottom corner of R-studio, you should be able to examine all the packages that are installed. The packages that are checked are the ones loaded into the memory.

As said, data.table is an enhanced venison of data.frame. Thus, recall what we did in the lecture about data.frame. We will learn

  • how to read data file into R as data.table.

  • how to manipulate data.table.

We will use the flights dataset for the purpose of illustration. The flights dataset contains flights information for all the flights departing from New York City airports in 2013. This dataset is from from the Bureau of Transporation Statistics.

8.2 fread(): Read csv file into R as data.table

Remember, in read.csv() reads a dataset into R as data.frame. In comparison, fread() from data.table package will read a csv file into R as a data.table.

# type ?fread to check its help document
?fread

Now, let’s read flights.csv into R using fread() function in data.table package.

flights=fread("data/flights.csv")

The flights.csv is 31.9MB, with 336,776 observation and 19 variables. This is not a small dataset. But as you can feel, fread() is blazing fast.

We can use the class() function to examine the type of the data.table (flights) we just created:

class(flights)
## [1] "data.table" "data.frame"

As seen, flights belongs to both ‘data.table’ and ‘data.frame’. As mentioned, this is because data.table is an extension (enhancement of data.frame). Therefore, all functions works on data.frame all works on data.table.

Now, let’s compare the performance of data.table and data.frame in terms reading data into R. Performance benchmark between different programs is very common to determine which one is better.

system.time( read.csv("data/flights.csv") )
##    user  system elapsed 
##    1.48    0.09    1.58
system.time( fread("data/flights.csv") )
##    user  system elapsed 
##    0.11    0.02    0.05

As seen, read.csv() spends 1.42 seconds to read the data; while fread() spends 0.06 seconds to read the data. fread() is 23 times faster (this will be different depending on the machine you used). The advantage will be even prominent if we are dealing with even big file.

8.2.1 Convert data.frame into data.table

You can convert an existing data.frame into a data.table using data.table() function.

tmp=read.csv("data/flights.csv") 
class(tmp)
## [1] "data.frame"
tmp=data.table(tmp)
class(tmp)
## [1] "data.table" "data.frame"

8.3 Quick summary of the data

In the excel file, we can easily “see” the data in the tabular format. With R-stuido, you can click “flights” on the top right corner to “see” the data. But more commonly, we will examine the data through a few quick summary:

Use head()/tail() to show the first/last 6 rows of the data.

head(flights)
##    year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1: 2013     1   1      517            515         2      830            819
## 2: 2013     1   1      533            529         4      850            830
## 3: 2013     1   1      542            540         2      923            850
## 4: 2013     1   1      544            545        -1     1004           1022
## 5: 2013     1   1      554            600        -6      812            837
## 6: 2013     1   1      554            558        -4      740            728
##    arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1:        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
## 2:        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
## 3:        33      AA   1141  N619AA    JFK  MIA      160     1089    5     40
## 4:       -18      B6    725  N804JB    JFK  BQN      183     1576    5     45
## 5:       -25      DL    461  N668DN    LGA  ATL      116      762    6      0
## 6:        12      UA   1696  N39463    EWR  ORD      150      719    5     58
##              time_hour
## 1: 2013-01-01 05:00:00
## 2: 2013-01-01 05:00:00
## 3: 2013-01-01 05:00:00
## 4: 2013-01-01 05:00:00
## 5: 2013-01-01 06:00:00
## 6: 2013-01-01 05:00:00
tail(flights)
##    year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1: 2013     9  30       NA           1842        NA       NA           2019
## 2: 2013     9  30       NA           1455        NA       NA           1634
## 3: 2013     9  30       NA           2200        NA       NA           2312
## 4: 2013     9  30       NA           1210        NA       NA           1330
## 5: 2013     9  30       NA           1159        NA       NA           1344
## 6: 2013     9  30       NA            840        NA       NA           1020
##    arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1:        NA      EV   5274  N740EV    LGA  BNA       NA      764   18     42
## 2:        NA      9E   3393    <NA>    JFK  DCA       NA      213   14     55
## 3:        NA      9E   3525    <NA>    LGA  SYR       NA      198   22      0
## 4:        NA      MQ   3461  N535MQ    LGA  BNA       NA      764   12     10
## 5:        NA      MQ   3572  N511MQ    LGA  CLE       NA      419   11     59
## 6:        NA      MQ   3531  N839MQ    LGA  RDU       NA      431    8     40
##              time_hour
## 1: 2013-09-30 18:00:00
## 2: 2013-09-30 14:00:00
## 3: 2013-09-30 22:00:00
## 4: 2013-09-30 12:00:00
## 5: 2013-09-30 11:00:00
## 6: 2013-09-30 08:00:00

Use summary() to get summary statistics for each variable.

summary(flights)
##       year          month             day           dep_time    sched_dep_time
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
##                                                  NA's   :8255                 
##    dep_delay          arr_time    sched_arr_time   arr_delay       
##  Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
##  1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
##  Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
##  Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
##  3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945   3rd Qu.:  14.000  
##  Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
##  NA's   :8255      NA's   :8713                  NA's   :9430      
##    carrier              flight       tailnum             origin         
##  Length:336776      Min.   :   1   Length:336776      Length:336776     
##  Class :character   1st Qu.: 553   Class :character   Class :character  
##  Mode  :character   Median :1496   Mode  :character   Mode  :character  
##                     Mean   :1972                                        
##                     3rd Qu.:3465                                        
##                     Max.   :8500                                        
##                                                                         
##      dest              air_time        distance         hour      
##  Length:336776      Min.   : 20.0   Min.   :  17   Min.   : 1.00  
##  Class :character   1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00  
##  Mode  :character   Median :129.0   Median : 872   Median :13.00  
##                     Mean   :150.7   Mean   :1040   Mean   :13.18  
##                     3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
##                     Max.   :695.0   Max.   :4983   Max.   :23.00  
##                     NA's   :9430                                  
##      minute        time_hour                  
##  Min.   : 0.00   Min.   :2013-01-01 05:00:00  
##  1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00  
##  Median :29.00   Median :2013-07-03 10:00:00  
##  Mean   :26.23   Mean   :2013-07-03 05:02:36  
##  3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00  
##  Max.   :59.00   Max.   :2013-12-31 23:00:00  
## 

str() is THE function for examining structure of a dataset.

str(flights)
## Classes 'data.table' and 'data.frame':   336776 obs. of  19 variables:
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : int  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : int  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
##  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : int  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : int  1400 1416 1089 1576 762 719 1065 229 944 733 ...
##  $ hour          : int  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : int  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
##  - attr(*, ".internal.selfref")=<externalptr>

8.4 Data Manipulation with data.table

In many cases, we need to inquiry and subset a dataset. Structured Query Language (SQL) is designed for that purpose. data.table syntax is actually very similar to the SQL syntax.

data.table provides a simple, consistent, user-friendly syntax for data manipulation. The general form of data.table syntax is:

DT[i, j, by]

where

  • DT is a data.table.

  • by: grouped by what?

  • j: what to do?

  • i: on which rows?

We will apply this form to manipulate data: Filtering rows, selecting columns, aggregating, … # # Filtering Rows Here we show how to filter particular rows with data.table.

8.4.1 Filtering rows through logical expression

We can filter rows based on logical expression. E.g.,

# filtering flights with "JFK" as the origin in the month of June.
tmp <- flights[origin == "JFK" & month == 6] 
head(tmp)
##    year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1: 2013     6   1        2           2359         3      341            350
## 2: 2013     6   1      538            545        -7      925            922
## 3: 2013     6   1      539            540        -1      832            840
## 4: 2013     6   1      553            600        -7      700            711
## 5: 2013     6   1      554            600        -6      851            908
## 6: 2013     6   1      557            600        -3      934            942
##    arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1:        -9      B6    739  N618JB    JFK  PSE      200     1617   23     59
## 2:         3      B6    725  N806JB    JFK  BQN      203     1576    5     45
## 3:        -8      AA    701  N5EAAA    JFK  MIA      140     1089    5     40
## 4:       -11      EV   5716  N835AS    JFK  IAD       42      228    6      0
## 5:       -17      UA   1159  N33132    JFK  LAX      330     2475    6      0
## 6:        -8      B6    715  N766JB    JFK  SJU      198     1598    6      0
##              time_hour
## 1: 2013-06-01 23:00:00
## 2: 2013-06-01 05:00:00
## 3: 2013-06-01 05:00:00
## 4: 2013-06-01 06:00:00
## 5: 2013-06-01 06:00:00
## 6: 2013-06-01 06:00:00

As see, this is much clear and easy to read compared to the syntax under data.frame; and we do not need to write flights$origin, flights$month as we did in data.frame.

8.4.2 Filtering rows by row index

We can also filter rows by index. E.g.,

#  filtering the first two rows from flights.
flights[1:2]
##    year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1: 2013     1   1      517            515         2      830            819
## 2: 2013     1   1      533            529         4      850            830
##    arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1:        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
## 2:        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
##              time_hour
## 1: 2013-01-01 05:00:00
## 2: 2013-01-01 05:00:00

In data.table, .N is a special symbol that contains the number of rows of the data.table. E.g.,

# Return the last row of flights
flights[.N] 
##    year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1: 2013     9  30       NA            840        NA       NA           1020
##    arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1:        NA      MQ   3531  N839MQ    LGA  RDU       NA      431    8     40
##              time_hour
## 1: 2013-09-30 08:00:00
# same as flights[336776]
flights[336776]
##    year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1: 2013     9  30       NA            840        NA       NA           1020
##    arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1:        NA      MQ   3531  N839MQ    LGA  RDU       NA      431    8     40
##              time_hour
## 1: 2013-09-30 08:00:00

8.4.3 Fitering rows through %between%, %chin%, %like%

%between% allows you to search for numerical values in the closed interval [val1, val2]. The syntax is: numeric_col %between% c(val1, val2)

# filtering all flights with arr_delay between 120 to 180 minutes
tmp=flights[arr_delay%between% c(120, 180)]

head(tmp)
##    year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1: 2013     1   1      811            630       101     1047            830
## 2: 2013     1   1      957            733       144     1056            853
## 3: 2013     1   1     1114            900       134     1447           1222
## 4: 2013     1   1     1505           1310       115     1638           1431
## 5: 2013     1   1     1525           1340       105     1831           1626
## 6: 2013     1   1     1549           1445        64     1912           1656
##    arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1:       137      MQ   4576  N531MQ    LGA  CLT      118      544    6     30
## 2:       123      UA    856  N534UA    EWR  BOS       37      200    7     33
## 3:       145      UA   1086  N76502    LGA  IAH      248     1416    9      0
## 4:       127      EV   4497  N17984    EWR  RIC       63      277   13     10
## 5:       125      B6    525  N231JB    EWR  MCO      152      937   13     40
## 6:       136      EV   4181  N21197    EWR  MCI      234     1092   14     45
##              time_hour
## 1: 2013-01-01 06:00:00
## 2: 2013-01-01 07:00:00
## 3: 2013-01-01 09:00:00
## 4: 2013-01-01 13:00:00
## 5: 2013-01-01 13:00:00
## 6: 2013-01-01 14:00:00

%chin% is only for character vectors, it allows you to filter rows with character values in a set: character_col %chin% c(“val1”, “val2”, “val3”)

# Filtering flights with "JFK" or "LGA" as origin in the month of May
tmp <- flights[origin %chin% c("JFK","LGA") & month==5] 
head(tmp)
##    year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1: 2013     5   1        9           1655       434      308           2020
## 2: 2013     5   1      537            540        -3      836            840
## 3: 2013     5   1      544            545        -1      818            827
## 4: 2013     5   1      548            600       -12      831            854
## 5: 2013     5   1      549            600       -11      804            810
## 6: 2013     5   1      553            600        -7      700            712
##    arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1:       408      VX    413  N628VA    JFK  LAX      341     2475   16     55
## 2:        -4      AA    701  N5BYAA    JFK  MIA      144     1089    5     40
## 3:        -9      UA    450  N494UA    LGA  IAH      190     1416    5     45
## 4:       -23      B6    371  N523JB    LGA  FLL      140     1076    6      0
## 5:        -6      MQ   4650  N520MQ    LGA  ATL      113      762    6      0
## 6:       -12      EV   5747  N877AS    LGA  IAD       48      229    6      0
##              time_hour
## 1: 2013-05-01 16:00:00
## 2: 2013-05-01 05:00:00
## 3: 2013-05-01 05:00:00
## 4: 2013-05-01 06:00:00
## 5: 2013-05-01 06:00:00
## 6: 2013-05-01 06:00:00

%like% allows you to search for a pattern in a character or a factor vector. It is best to illustrate what pattern means with an example. E.g., the two characters “AA” and “AS” (acronym for American Airline and Alaska Airlines) are both started with a upper case U. This is called a pattern. In R, “^A” is used to denoted this pattern. We will talk more about pattern when we deal with strings.

# filter carriers whose acronym starts with A.
tmp <- flights[carrier %like% "^A"] 
head(tmp)
##    year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1: 2013     1   1      542            540         2      923            850
## 2: 2013     1   1      558            600        -2      753            745
## 3: 2013     1   1      559            600        -1      941            910
## 4: 2013     1   1      606            610        -4      858            910
## 5: 2013     1   1      623            610        13      920            915
## 6: 2013     1   1      628            630        -2     1137           1140
##    arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1:        33      AA   1141  N619AA    JFK  MIA      160     1089    5     40
## 2:         8      AA    301  N3ALAA    LGA  ORD      138      733    6      0
## 3:        31      AA    707  N3DUAA    LGA  DFW      257     1389    6      0
## 4:       -12      AA   1895  N633AA    EWR  MIA      152     1085    6     10
## 5:         5      AA   1837  N3EMAA    LGA  MIA      153     1096    6     10
## 6:        -3      AA    413  N3BAAA    JFK  SJU      192     1598    6     30
##              time_hour
## 1: 2013-01-01 05:00:00
## 2: 2013-01-01 06:00:00
## 3: 2013-01-01 06:00:00
## 4: 2013-01-01 06:00:00
## 5: 2013-01-01 06:00:00
## 6: 2013-01-01 06:00:00

8.5 Select Columns

The general form of data.table syntax is:

DT[i, j, by]

where

  • by: grouped by what?

  • j: what to do?

  • i: on which rows?

The second argument j is used to select (and compute on) columns.

8.5.1 Using column names to select columns

Because data.table is an enhanced data.frame, we can select columns by column names in accordance with data.frame.

# Select dep_delay and arr_delay column 
tmp <- flights[, c("dep_delay","arr_delay")]

# show the first 6 rows using head() function.
head(tmp)  
##    dep_delay arr_delay
## 1:         2        11
## 2:         4        20
## 3:         2        33
## 4:        -1       -18
## 5:        -6       -25
## 6:        -4        12
# create a scatter plot between these two variables.
plot(tmp)

The plot shows that there is a strong linear relationship between dep_delay and arr_delay, which essentially means: if you deptures late by x minutes, you are very likely to arrive later by x minutes.

Note that, as seen in flights[, c(“dep_delay”,“arr_delay”)], although we do not filter on rows, we need to include “,” to add a place holder for argument i. Otherwise, the code will generate an error msg.

We can also select column based on the column index, which, however, is not strongly not recommended.

flights[,c(6,9)]  # dep_delay and arr_delay are at 6,9th column

This is not recommended because if the order of columns changes (which is likely to happen if you are deleting or adding columns), the result will be wrong.

We can also deselecting columns: -c(“col1”, “col2”) deselects the columns col1 and col2; in other words, delete col1 and col2.

# disselect arr_delay and dep_delay column
flights[,-c("dep_delay","arr_delay")]

8.5.2 Select columns as variables - the data.table way

In data.table, each column is treated as a variable, thus, you can use a list of variables (column names) to select columns.

# Select the arr_delay and dep_delay column.
tmp <- flights[, list(dep_delay,arr_delay)]

The difference of this code and the above code is that dep_delay and arr_delay are not inside "". This is because dep_delay and arr_delay are treated as variable, rather column names.

.() is an alias to list(), for convenience. We can use .() to replace list(). This reduces typing and allows you to focus on the variables that are selected when reading the code.

tmp <- flights[, .(dep_delay,arr_delay)]

We can also rename the column while selecting the column. E.g., the following code select arr_delay and dep_delay and rename to arrive_delay and depart_delay

tmp <- flights[, .(arrive_delay = arr_delay, depart_delay = dep_delay)]
head(tmp)
##    arrive_delay depart_delay
## 1:           11            2
## 2:           20            4
## 3:           33            2
## 4:          -18           -1
## 5:          -25           -6
## 6:           12           -4

8.5.3 Computing on columns

Since columns in data.table can be referred to as variables, you can compute directly on them in j. We will find this is a quick and easy way to get summary statistics from the data.

E.g., you want to know the average arr_delay and dep_delay of all the fligths? in other words, you want to compute the mean of arr_delay and dep_delay.

# calcuate the mean arr_delay
flights[,mean(arr_delay)]
## [1] NA
flights[,mean(arr_delay, na.rm = TRUE)]
## [1] 6.895377

We can also rename the calculated columns:

# calcuate the mean arr_delay and dep_delay
flights[,.(mean(arr_delay, na.rm = TRUE), mean(dep_delay, na.rm = TRUE))]
##          V1       V2
## 1: 6.895377 12.63907

You can compute and name multiple variables at the same time:

# calcuate the mean arr_delay and dep_delay
flights[,.(avg_arr_delay=mean(arr_delay, na.rm = TRUE), avg_dep_delay=mean(dep_delay, na.rm = TRUE))]
##    avg_arr_delay avg_dep_delay
## 1:      6.895377      12.63907

8.5.4 Computing on rows and columns

Combining i and j is straightforward. E.g., you want to compute the average arr_deply and dep_deply for all flights origined from “JFK” in the month of Jun

flights[origin=="JFK" & month==6,.(avg_arr_delay=mean(arr_delay, na.rm = TRUE), avg_dep_delay=mean(dep_delay, na.rm = TRUE))]
##    avg_arr_delay avg_dep_delay
## 1:      17.59693      20.49973

Now, let break this code to understand how it works. We first subset in i to find matching row indices where origin airport equals “JFK”, and month equals 6; next we look at j and select the two columns and compute their mean(). Note that the code will return a data.table.

8.5.5 Special symbol .N in j

Remember .N denotes the number of rows. .N can be used in j as well. It is particularly useful to get the number of rows after filtering in i.

E.g., you want to compute how many flights are delayed (i.e., arr_delay + dep_delay>0).

flights[arr_delay + dep_delay>0, .N]
## [1] 135059

In the above code, the data.table is first filtered by arr_delay + dep_delay<=0, and then compute the number of selected rows.

8.5.6 Advanced Computation on columns

Because the columns can be treated as variable in data.table, thus we can perform complex calculation on these variable in j.

E.g., we want to compute the average speed of all flights, where speed is measured by distance/(airm_time/60) (miles per hour)?

flights[, .(miles_per_hr=mean( 60*distance/air_time , na.rm = TRUE))]
##    miles_per_hr
## 1:     394.2737

8.5.7 Sort Data by One/Multiple Columns

The data.table has very simple syntax to sort data based on one/multiple columns. E.g., sort flights first by origin in ascending order, and then by dest in descending order

# Use the order() function to sort by columns.
tmp <- flights[order(origin, -dest)]
head(tmp)
##    year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1: 2013     1   2      905            822        43     1313           1045
## 2: 2013     1   3      848            850        -2     1149           1113
## 3: 2013     1   4      901            850        11     1120           1113
## 4: 2013     1   6      843            848        -5     1053           1111
## 5: 2013     1   7      858            850         8     1105           1113
## 6: 2013     1   8      847            850        -3     1116           1113
##    arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1:        NA      EV   4140  N15912    EWR  XNA       NA     1131    8     22
## 2:        36      EV   4125  N21129    EWR  XNA      196     1131    8     50
## 3:         7      EV   4125  N16178    EWR  XNA      168     1131    8     50
## 4:       -18      EV   4625  N12172    EWR  XNA      174     1131    8     48
## 5:        -8      EV   4125  N13118    EWR  XNA      163     1131    8     50
## 6:         3      EV   4125  N14180    EWR  XNA      177     1131    8     50
##              time_hour
## 1: 2013-01-02 08:00:00
## 2: 2013-01-03 08:00:00
## 3: 2013-01-04 08:00:00
## 4: 2013-01-06 08:00:00
## 5: 2013-01-07 08:00:00
## 6: 2013-01-08 08:00:00

8.6 Calcuation by Group

We’ve already seen i and j from data.table’s general form. Here, we’ll see how they can be combined together with “by” to perform operations by group.

E.g., how can we get the number of flights corresponding to each origin airport?

flights[, .(count=.N), by = .(origin)]
##    origin  count
## 1:    EWR 120835
## 2:    LGA 104662
## 3:    JFK 111279

e.g., How can we get the total number of flights and average arrival delay and std deviation of arrial delay for each origin, dest pair for carrier code “AA”?

tmp<-flights[carrier == "AA", .(count=.N, avg_arr_delay=mean(arr_delay, na.rm = TRUE), sd_arr_delay=sd(arr_delay, na.rm = TRUE)), by = .(origin, dest)]
head(tmp)
##    origin dest count avg_arr_delay sd_arr_delay
## 1:    JFK  MIA  2221   -3.23122440     36.81734
## 2:    LGA  ORD  5694   -1.31396831     41.07306
## 3:    LGA  DFW  4836   -1.18901523     39.21697
## 4:    EWR  MIA  1068    0.06332703     54.78252
## 5:    LGA  MIA  3945   -1.64248971     37.45564
## 6:    JFK  SJU  1099   -0.77603687     37.28137

From this summary, we can see that AA on average arrive on time. However, the high std deviation means there is great variation in terms of the arrive delay.

How can we get the average arrival and departure delay for each origin, dest pair for each month for carrier code “AA”?

tmp=flights[carrier == "AA", .(avg_arr_delay=mean(arr_delay), avg_dep_delay=mean(dep_delay)), by = .(origin, dest, month)]


# plot the avg_arr_delay, avg_dep_delay and use color to mark different month
library(ggplot2)
ggplot(tmp,aes(avg_arr_delay, avg_dep_delay, color=factor(month)))+
  geom_point()
## Warning: Removed 151 rows containing missing values (geom_point).

The data.table summary and the plot together provide a great way to visualize which month are more likely to have flights delay.

Note that, the above code retain the original ordering of origin-dest pair. There are cases when preserving the original order is essential. However, if we want to automatically sort by the variables in our grouping, we can simply change by to “keyby”

# sort by origin, dest, month
tmp <- flights[carrier == "AA", .(avg_arr_delay=mean(arr_delay, na.rm = TRUE), avg_dep_delay=mean(dep_delay, na.rm = TRUE)), keyby = .(origin, dest, month)]
head(tmp)
##    origin dest month avg_arr_delay avg_dep_delay
## 1:    EWR  DFW     1      8.042945      9.587879
## 2:    EWR  DFW     2      8.222222     11.855172
## 3:    EWR  DFW     3      1.113095     16.035503
## 4:    EWR  DFW     4     13.067485     16.713415
## 5:    EWR  DFW     5     -6.158537     10.053571
## 6:    EWR  DFW     6      4.148148     18.153374

Notice that, the datatable is ordered based on variables in our grouping (origin-dest-month). This is easy for reader to examine the the pattern in our data in a tabular form.

8.7 Summary

  • R has currently 120,000 packages, which can enhance the power of base R. data.table is a enhanced verison of data.frame to handle big data.

  • Learn how to install and load R packages.

  • data.table syntax: DT[i, j, by]: by - grouped by what; j - what to do; i - on which rows.

  • use data.table syntax to manipulate the flights dataset: filtering rows, selecting columns; computing columns; aggregrating by group.