Chapter 10 Join Mutliple Datasets

It’s rare that data scientists work with only a single table of data during a project. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.

As an illustration, in addition to the flights dataset, we will use

  • “planes.csv”, which contains the information about the airplane for each flights in the “flights.csv” dataset

  • and “weather.csv” dataset, which contains the weather information for the time period when “flights.csv” is collected.

First thing first, we need to load the data.table into R.

library(data.table)

Let’s read these three data file into R using fread() and examine these datasets.

flights=fread("data/flights.csv")
str(flights)
## Classes 'data.table' and 'data.frame':   336776 obs. of  19 variables:
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : int  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : int  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
##  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : int  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : int  1400 1416 1089 1576 762 719 1065 229 944 733 ...
##  $ hour          : int  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : int  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
##  - attr(*, ".internal.selfref")=<externalptr>
weather=fread("data/weather.csv")
str(weather)
## Classes 'data.table' and 'data.frame':   26115 obs. of  15 variables:
##  $ origin    : chr  "EWR" "EWR" "EWR" "EWR" ...
##  $ year      : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hour      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ temp      : num  39 39 39 39.9 39 ...
##  $ dewp      : num  26.1 27 28 28 28 ...
##  $ humid     : num  59.4 61.6 64.4 62.2 64.4 ...
##  $ wind_dir  : int  270 250 240 250 260 240 240 250 260 260 ...
##  $ wind_speed: num  10.36 8.06 11.51 12.66 12.66 ...
##  $ wind_gust : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ precip    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pressure  : num  1012 1012 1012 1012 1012 ...
##  $ visib     : num  10 10 10 10 10 10 10 10 10 10 ...
##  $ time_hour : POSIXct, format: "2013-01-01 01:00:00" "2013-01-01 02:00:00" ...
##  - attr(*, ".internal.selfref")=<externalptr>

Weather is uniquely identify by “origin, year, month, day, hour”; it records the weather at particular origin airport at particular hour of a day. Generally, each dataset should have a set of variables (also called key) to uniquely identify an observation.

planes=fread("data/planes.csv")
str(planes)
## Classes 'data.table' and 'data.frame':   3322 obs. of  9 variables:
##  $ tailnum     : chr  "N10156" "N102UW" "N103US" "N104UW" ...
##  $ year        : int  2004 1998 1999 1999 2002 1999 1999 1999 1999 1999 ...
##  $ type        : chr  "Fixed wing multi engine" "Fixed wing multi engine" "Fixed wing multi engine" "Fixed wing multi engine" ...
##  $ manufacturer: chr  "EMBRAER" "AIRBUS INDUSTRIE" "AIRBUS INDUSTRIE" "AIRBUS INDUSTRIE" ...
##  $ model       : chr  "EMB-145XR" "A320-214" "A320-214" "A320-214" ...
##  $ engines     : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ seats       : int  55 182 182 182 55 182 182 182 182 182 ...
##  $ speed       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ engine      : chr  "Turbo-fan" "Turbo-fan" "Turbo-fan" "Turbo-fan" ...
##  - attr(*, ".internal.selfref")=<externalptr>

The key of planes dataset is tailnum, the unique ID of a plane.

10.0.1 Join files using merge()

The merge() function also exists in data.frame to merge data.frames. data.table also implement this function so that we can use the merge() function to join two data.table.

Between we talk about the syntax of merge(), we need to understand the 4 different types of join.

  • inner join

  • full join

  • left join

  • right join

Don’t be overwhelmed! They are actually very similar. The general syntax of merge() is:

merge(x = dt1, y = dt2, by.x = “name”, by.y = “name”, all.x=FALSE, all.y=FALSE)

The 4 cases of all.x and all.y corresponds to 4 different types of join. Note that the default value of all.x and all.y are FALSE.

10.0.2 inner join

We will start with an example.

tmp=merge(flights, planes, by.x="tailnum", by.y="tailnum")
head(tmp)
##    tailnum year.x month day dep_time sched_dep_time dep_delay arr_time
## 1:  N10156   2013     1  10      626            630        -4      802
## 2:  N10156   2013     1  10     1120           1032        48     1320
## 3:  N10156   2013     1  10     1619           1540        39     1831
## 4:  N10156   2013     1  11      632            634        -2      810
## 5:  N10156   2013     1  11     1116           1120        -4     1328
## 6:  N10156   2013     1  11     1845           1819        26     1959
##    sched_arr_time arr_delay carrier flight origin dest air_time distance hour
## 1:            800         2      EV   4560    EWR  PIT       60      319    6
## 2:           1240        40      EV   4269    EWR  CHS       99      628   10
## 3:           1744        47      EV   4667    EWR  MSP      175     1008   15
## 4:            822       -12      EV   4334    EWR  CMH       81      463    6
## 5:           1336        -8      EV   4298    EWR  MCI      171     1092   11
## 6:           1932        27      EV   4520    EWR  PWM       49      284   18
##    minute           time_hour year.y                    type manufacturer
## 1:     30 2013-01-10 06:00:00   2004 Fixed wing multi engine      EMBRAER
## 2:     32 2013-01-10 10:00:00   2004 Fixed wing multi engine      EMBRAER
## 3:     40 2013-01-10 15:00:00   2004 Fixed wing multi engine      EMBRAER
## 4:     34 2013-01-11 06:00:00   2004 Fixed wing multi engine      EMBRAER
## 5:     20 2013-01-11 11:00:00   2004 Fixed wing multi engine      EMBRAER
## 6:     19 2013-01-11 18:00:00   2004 Fixed wing multi engine      EMBRAER
##        model engines seats speed    engine
## 1: EMB-145XR       2    55    NA Turbo-fan
## 2: EMB-145XR       2    55    NA Turbo-fan
## 3: EMB-145XR       2    55    NA Turbo-fan
## 4: EMB-145XR       2    55    NA Turbo-fan
## 5: EMB-145XR       2    55    NA Turbo-fan
## 6: EMB-145XR       2    55    NA Turbo-fan

This is the inner join: only observations with tailnum in both data.table are kept. The unmatched observation is not kept. You can check the dimension of the merged data.table. You will find that the number of observation in the merged data is smaller than both flights and planes. This is because unmatched rows will be removed.

dim(tmp)
## [1] 284170     27

Since “tailnum” has the same column name in both data.table, we can simplify the code as

tmp=merge(flights, planes, by="tailnum")

If you examine tmp carefully, you will find the new variable year.x and year.y. This is because year appears in both flights and planes dataset; when merge these two datasets, we need to differentiate which dataset the year variable is from. Based on argument position, flight is at x and plane is at y, thus R will automatically rename the variables of same name to mark their sources.

10.0.3 full join

Full join keeps all observations that are in either data.table. The code is exactly the same except letting all.x=TRUE and all.y=TRUE, meaning the merge will keep all observation from x and all observation from y.

tmp=merge(flights, planes, by="tailnum", all.x = TRUE, all.y = TRUE)

10.0.4 left join/right join

Left join adds information from the right data.table to the left data.table. E.g., we want to add the plane information to the flights dataset.

tmp=merge(flights, planes, by="tailnum", all.x = TRUE)

Since we want to add information to flights, therefore, we will keep all flights information; Thus, in the above code, we set all.x=TRUE.

Right join is essentially the same with left join; we can always switch the position of x and y data.table to achieve the same goal.

left/right join is the most common join in data analysis, because we constantly need to add information to the main dataset.

10.1 Merge dataset using data.table syntax

The data.table syntax for joining is as below:

DT2[DT1, on=.(name)]

The code will add information from DT2 to DT1, and the variable to match DT1 and DT2 is name. The default join in data.table is right join.

If the variable name are different in DT1 and DT2, we can modify the code as:

DT2[DT1, on=c(name2=“name1”)]

Let’s look at an example: add the planes information to the flights dataset.

tmp=planes[flights, on=.(tailnum)]
dim(tmp)
## [1] 336776     27
head(tmp)
##    tailnum year                    type manufacturer     model engines seats
## 1:  N14228 1999 Fixed wing multi engine       BOEING   737-824       2   149
## 2:  N24211 1998 Fixed wing multi engine       BOEING   737-824       2   149
## 3:  N619AA 1990 Fixed wing multi engine       BOEING   757-223       2   178
## 4:  N804JB 2012 Fixed wing multi engine       AIRBUS  A320-232       2   200
## 5:  N668DN 1991 Fixed wing multi engine       BOEING   757-232       2   178
## 6:  N39463 2012 Fixed wing multi engine       BOEING 737-924ER       2   191
##    speed    engine i.year month day dep_time sched_dep_time dep_delay arr_time
## 1:    NA Turbo-fan   2013     1   1      517            515         2      830
## 2:    NA Turbo-fan   2013     1   1      533            529         4      850
## 3:    NA Turbo-fan   2013     1   1      542            540         2      923
## 4:    NA Turbo-fan   2013     1   1      544            545        -1     1004
## 5:    NA Turbo-fan   2013     1   1      554            600        -6      812
## 6:    NA Turbo-fan   2013     1   1      554            558        -4      740
##    sched_arr_time arr_delay carrier flight origin dest air_time distance hour
## 1:            819        11      UA   1545    EWR  IAH      227     1400    5
## 2:            830        20      UA   1714    LGA  IAH      227     1416    5
## 3:            850        33      AA   1141    JFK  MIA      160     1089    5
## 4:           1022       -18      B6    725    JFK  BQN      183     1576    5
## 5:            837       -25      DL    461    LGA  ATL      116      762    6
## 6:            728        12      UA   1696    EWR  ORD      150      719    5
##    minute           time_hour
## 1:     15 2013-01-01 05:00:00
## 2:     29 2013-01-01 05:00:00
## 3:     40 2013-01-01 05:00:00
## 4:     45 2013-01-01 05:00:00
## 5:      0 2013-01-01 06:00:00
## 6:     58 2013-01-01 05:00:00

The dimension of the merged dataset indicates that all rows of flights dataset are kept. This is because we are adding plane information to the flights dataset. Be really careful about the where to put the main dataset (should be inside the [] because the default is right join).

Examing the merged dataset, you will find a new variable i.year, which indicates the year from flights dataset (because flights is in the i argument position).

For inner join, i.e., removing unmatched rows, we can set nomatch=0, as shown below:

tmp=planes[flights, on=.(tailnum), nomatch=0]
dim(tmp)
## [1] 284170     27

Full join is not possible with the data.table syntax, use the merge() function instead. You must be wondering why we need another way of joining data.table. This is because, with the data.table syntax, we can chain the inquiry like this DT[…][…][…].

10.2 Chaining Join and Data Manipulation

data.table expressions can be chained in sequence: DT[…][…][…]. This enables us to join multiple data.table and then manipulate the merged data.table for insights.

E.g., we want to compute the market share of airplane manufacturers in terms of the number flights using the airplanes made by the manufacturers. We need to first include the manufacturer information to the flights dataset and then compute the count statistics.

planes[flights, on=.(tailnum)][,.(count=.N), by=manufacturer]
##                      manufacturer count
##  1:                        BOEING 82912
##  2:                        AIRBUS 47302
##  3:              AIRBUS INDUSTRIE 40891
##  4:                      CANADAIR  1594
##  5:                          <NA> 52606
##  6: MCDONNELL DOUGLAS AIRCRAFT CO  8932
##  7:                       EMBRAER 66068
##  8: MCDONNELL DOUGLAS CORPORATION  1259
##  9:                BOMBARDIER INC 28272
## 10:             MCDONNELL DOUGLAS  3998
## 11:          GULFSTREAM AEROSPACE   499
## 12:                        CESSNA   658
## 13:            HURLEY JAMES LARRY    17
## 14:            CIRRUS DESIGN CORP   291
## 15:                         PIPER   162
## 16:                FRIEDEMANN JON    63
## 17:        ROBINSON HELICOPTER CO   286
## 18:                   PAIR MIKE E    25
## 19:                 BARKER JACK L   252
## 20:               LAMBERT RICHARD    54
## 21:                    MARZ BARRY    44
## 22:                  CANADAIR LTD   103
## 23:                         BEECH    47
## 24:         AMERICAN AIRCRAFT INC    42
## 25:               LEBLANC GLENN T    40
## 26:                  STEWART MACO    55
## 27:                   DEHAVILLAND    63
## 28:            AVIAT AIRCRAFT INC    18
## 29:                    AGUSTA SPA    32
## 30:                  KILDALL GARY    51
## 31:                          BELL    65
## 32:                   LEARJET INC    19
## 33:                       DOUGLAS    22
## 34:                      SIKORSKY    27
## 35:        AVIONS MARCEL DASSAULT     4
## 36:                   JOHN G HESS     3
##                      manufacturer count

E.g., we want to know the average seat avialable between each origin,dest pair, which shows how well the two nodes are connected. The seats information is in the planes dataset, thus we need to first add planes to the flights dataset.

tmp=planes[flights, on=.(tailnum)][,.(avg_seat=mean(seats,na.rm = TRUE)), by=.(origin, dest)]

head(tmp)
##    origin dest avg_seat
## 1:    EWR  IAH 190.9485
## 2:    LGA  IAH 170.4513
## 3:    JFK  MIA 180.9665
## 4:    JFK  BQN 196.6559
## 5:    LGA  ATL 148.1784
## 6:    EWR  ORD 176.6021

10.3 Exercise

Suppose you are developing a system that can check the weather condition at each flight. One important task is to add the weather information to the flights dataset. Note that the key of the weather dataset is: year, month, day, hour, origin; in other words, this set of variable uniquely identify an observation in weather dataset.

You can do that using merge() function:

tmp=merge(flights, weather, by=c("year","month","day","hour","origin"), all.x = TRUE)
head(tmp)
##    year month day hour origin dep_time sched_dep_time dep_delay arr_time
## 1: 2013     1   1    5    EWR      517            515         2      830
## 2: 2013     1   1    5    EWR      554            558        -4      740
## 3: 2013     1   1    5    JFK      542            540         2      923
## 4: 2013     1   1    5    JFK      544            545        -1     1004
## 5: 2013     1   1    5    JFK      559            559         0      702
## 6: 2013     1   1    5    LGA      533            529         4      850
##    sched_arr_time arr_delay carrier flight tailnum dest air_time distance
## 1:            819        11      UA   1545  N14228  IAH      227     1400
## 2:            728        12      UA   1696  N39463  ORD      150      719
## 3:            850        33      AA   1141  N619AA  MIA      160     1089
## 4:           1022       -18      B6    725  N804JB  BQN      183     1576
## 5:            706        -4      B6   1806  N708JB  BOS       44      187
## 6:            830        20      UA   1714  N24211  IAH      227     1416
##    minute         time_hour.x  temp  dewp humid wind_dir wind_speed wind_gust
## 1:     15 2013-01-01 05:00:00 39.02 28.04 64.43      260   12.65858        NA
## 2:     58 2013-01-01 05:00:00 39.02 28.04 64.43      260   12.65858        NA
## 3:     40 2013-01-01 05:00:00 39.02 26.96 61.63      260   14.96014        NA
## 4:     45 2013-01-01 05:00:00 39.02 26.96 61.63      260   14.96014        NA
## 5:     59 2013-01-01 05:00:00 39.02 26.96 61.63      260   14.96014        NA
## 6:     29 2013-01-01 05:00:00 39.92 24.98 54.81      250   14.96014  21.86482
##    precip pressure visib         time_hour.y
## 1:      0   1011.9    10 2013-01-01 05:00:00
## 2:      0   1011.9    10 2013-01-01 05:00:00
## 3:      0   1012.1    10 2013-01-01 05:00:00
## 4:      0   1012.1    10 2013-01-01 05:00:00
## 5:      0   1012.1    10 2013-01-01 05:00:00
## 6:      0   1011.4    10 2013-01-01 05:00:00

We can plot the dep_delay against the weather condition (e.g., humid) at the origin airport to see whether there is a relationship between these two

# load the ggplot2 package into R
library(ggplot2)  

# scatter plot with humid on x, dep_delay on y
ggplot(tmp, aes(humid,dep_delay, col=origin))+
  geom_point(position = "jitter", alpha=0.3)
## Warning: Removed 9800 rows containing missing values (geom_point).

What is your observation of the chart? 1) it seems to suggest a higher humidity is associated with a longer departure delay; 2) however, there also exist many departure delay in a non-humid weather, which will be attributed to other factors.

Or you can do that using the data.table syntax:

# note that the main data.table is inside []
tmp=weather[flights, on=.(year,month, day, hour, origin)]
dim(tmp)
## [1] 336776     29

Now, suppose a customer is inquiring the scheduled departure time and the weather condition for flights UA1714 for the date 2013-01-01. You will inquire the flight and return the temp, wind_speed, humid, and visib. Use the data.table syntax to chain the inquiry.

weather[flights, on=.(year,month, day, hour, origin)][
  carrier=="UA" & flight==1714 & month==1 & day==1,.(year,month, day, hour, origin, carrier, flight, temp, wind_speed, humid, visib)]
##    year month day hour origin carrier flight  temp wind_speed humid visib
## 1: 2013     1   1    5    LGA      UA   1714 39.92   14.96014 54.81    10

10.4 Summary

  • We reviewed the 4 types of join: inner join, full join, left/right join. The most commonly used is th left/right join because we need to add information from one dataset to the main dataset

  • We learn the merge() to join datasets: the syntax for adding dt2 to dt1 is: merge(dt1, dt2, by.x=“name1”, by.y=“name2”, all.x=TRUE)

  • We learn the data.table syntax for joining data.table: dt2[dt1, on=.(name)].

  • We learn to chain the join and computation with data.table syntax DT[…][…][…]

So far, we have learned how to manipulate big data with data.table package. What we have learned so far should cover you for most of the common data manipulation in a real job. I hope you are amazed and empowered by the data.table packages.