Chapter 16 Visualization with ggplot2 III

In many cases, we work with location data. The best way to visualize the location data is map. To draw maps in R, we first need to install the map-related packages.

install.packages("ggplot2") 
install.packages("maps")      # maps packages contains world maps data 
install.packages("mapproj")   # Converts latitude/longitude into projected coordinates
library(ggplot2)
library(maps)
## Warning: package 'maps' was built under R version 4.0.5
library(mapproj)
## Warning: package 'mapproj' was built under R version 4.0.5

We will also be using data.table and lubridate packages to clean data; and curl package for download data directly from website.

library(lubridate)
library(data.table)
library(curl)
library(ggplot2)

16.1 Draw world map

First, we get the world map data from the map package.

# map_data() is a function in maps package to load the map data
world_map=as.data.table(map_data("world"))

head(world_map)
##         long      lat group order region subregion
## 1: -69.89912 12.45200     1     1  Aruba      <NA>
## 2: -69.89571 12.42300     1     2  Aruba      <NA>
## 3: -69.94219 12.43853     1     3  Aruba      <NA>
## 4: -70.00415 12.50049     1     4  Aruba      <NA>
## 5: -70.06612 12.54697     1     5  Aruba      <NA>
## 6: -70.05088 12.59707     1     6  Aruba      <NA>

Second, we can use ggplot2 to plot the world map:

ggplot() + 
  geom_polygon(data=world_map, aes(long, lat, group=group), fill="gray",color="black")+
  geom_path()+ 
  coord_map(xlim=c(-180,180))

Next, we want to add information of each country on the map. For example, we want to show the number of confirmed COVID-19 cases.

The csv format data can be access through from the following website:

covid_world=fread("https://covid.ourworldindata.org/data/owid-covid-data.csv")

Let’s first take a look at the data:

covid_world=covid_world[,.(iso_code,continent,location,date,total_cases,new_cases,total_deaths,new_deaths,total_tests_per_thousand,people_vaccinated_per_hundred)]
head(covid_world)
##    iso_code continent    location       date total_cases new_cases total_deaths
## 1:      AFG      Asia Afghanistan 2020-02-24           1         1           NA
## 2:      AFG      Asia Afghanistan 2020-02-25           1         0           NA
## 3:      AFG      Asia Afghanistan 2020-02-26           1         0           NA
## 4:      AFG      Asia Afghanistan 2020-02-27           1         0           NA
## 5:      AFG      Asia Afghanistan 2020-02-28           1         0           NA
## 6:      AFG      Asia Afghanistan 2020-02-29           1         0           NA
##    new_deaths total_tests_per_thousand people_vaccinated_per_hundred
## 1:         NA                       NA                            NA
## 2:         NA                       NA                            NA
## 3:         NA                       NA                            NA
## 4:         NA                       NA                            NA
## 5:         NA                       NA                            NA
## 6:         NA                       NA                            NA
str(covid_world)
## Classes 'data.table' and 'data.frame':   114995 obs. of  10 variables:
##  $ iso_code                     : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ continent                    : chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ location                     : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ date                         : IDate, format: "2020-02-24" "2020-02-25" ...
##  $ total_cases                  : num  1 1 1 1 1 1 1 1 2 4 ...
##  $ new_cases                    : num  1 0 0 0 0 0 0 0 1 2 ...
##  $ total_deaths                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_deaths                   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_tests_per_thousand     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ people_vaccinated_per_hundred: num  NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, ".internal.selfref")=<externalptr>
summary(covid_world)
##    iso_code          continent           location              date           
##  Length:114995      Length:114995      Length:114995      Min.   :2020-01-01  
##  Class :character   Class :character   Class :character   1st Qu.:2020-07-25  
##  Mode  :character   Mode  :character   Mode  :character   Median :2020-12-18  
##                                                           Mean   :2020-12-11  
##                                                           3rd Qu.:2021-05-02  
##                                                           Max.   :2021-09-06  
##                                                                               
##   total_cases          new_cases       total_deaths       new_deaths   
##  Min.   :        1   Min.   :-74347   Min.   :      1   Min.   :-1918  
##  1st Qu.:     1776   1st Qu.:     2   1st Qu.:     63   1st Qu.:    0  
##  Median :    17999   Median :    86   Median :    542   Median :    2  
##  Mean   :  1323061   Mean   :  6389   Mean   :  34173   Mean   :  146  
##  3rd Qu.:   192856   3rd Qu.:   892   3rd Qu.:   4713   3rd Qu.:   18  
##  Max.   :221211222   Max.   :905932   Max.   :4576372   Max.   :17977  
##  NA's   :5536        NA's   :5539     NA's   :16134     NA's   :15979  
##  total_tests_per_thousand people_vaccinated_per_hundred
##  Min.   :    0.00         Min.   :  0.00               
##  1st Qu.:   18.03         1st Qu.:  3.30               
##  Median :   91.11         Median : 15.11               
##  Mean   :  401.32         Mean   : 23.45               
##  3rd Qu.:  375.29         3rd Qu.: 40.29               
##  Max.   :13230.12         Max.   :117.71               
##  NA's   :65196            NA's   :91556

We need to clean the date into the correct format before any analysis: Examine extreme values

covid_world[new_cases<0 | new_cases>100000]
##       iso_code     continent            location       date total_cases
##    1:      ATG North America Antigua and Barbuda 2020-07-03          68
##    2:      ATG North America Antigua and Barbuda 2021-05-09        1231
##    3: OWID_ASI                              Asia 2020-08-26     6604021
##    4: OWID_ASI                              Asia 2020-08-27     6707611
##    5: OWID_ASI                              Asia 2020-08-28     6811518
##   ---                                                                  
## 1655: OWID_WRL                             World 2021-09-04   220348745
## 1656: OWID_WRL                             World 2021-09-05   220775393
## 1657: OWID_WRL                             World 2021-09-06   221211222
## 1658:      YEM          Asia               Yemen 2020-08-11        1831
## 1659:      ZWE        Africa            Zimbabwe 2020-05-02          34
##       new_cases total_deaths new_deaths total_tests_per_thousand
##    1:        -1            3          0                       NA
##    2:        -1           32          0                       NA
##    3:    112974       135017       1682                       NA
##    4:    103590       136658       1641                       NA
##    5:    103907       138253       1595                       NA
##   ---                                                           
## 1655:    480942      4561996       7736                       NA
## 1656:    426648      4568456       6460                       NA
## 1657:    435829      4576372       7916                       NA
## 1658:        -1          523          5                       NA
## 1659:        -6            4          0                       NA
##       people_vaccinated_per_hundred
##    1:                            NA
##    2:                            NA
##    3:                            NA
##    4:                            NA
##    5:                            NA
##   ---                              
## 1655:                         40.33
## 1656:                         40.47
## 1657:                         40.61
## 1658:                            NA
## 1659:                            NA

Clean up the data:

covid_world[,date:=ymd(date)]

# in the world map data, United States is named USA
covid_world[location=="United States",location:="USA"]  

# clean up the negative new cases as missing value
covid_world[new_cases<0,new_cases:=NA]  

Use a line chart to show the number of new cases over time:

ggplot(covid_world)+
  geom_line(aes(date, new_cases, color=location),show.legend = FALSE)+
  theme_classic()
## Warning: Removed 5538 row(s) containing missing values (geom_path).

Use a line chart to show the number of new cases over time:

ggplot(covid_world[location%in%c("United Kingdom","USA","Italy","Brazil","Korean","China")])+
  geom_line(aes(date, new_cases, color=location))+
  theme_classic()
## Warning: Removed 2 row(s) containing missing values (geom_path).

To plot the map of new cases in each day, we need to first prepare the map data to include the COVID-19 cases informatin.

# define any date you want to inquiry
today=ymd("2020-06-19") 

# subset the covid dataset to select the above date and relevant columns
covid_today=covid_world[date==today,.(location,continent,date,total_cases,new_cases, total_deaths, total_tests_per_thousand)]

# combine the map data file
my_map = merge(world_map,covid_today, by.x="region", by.y ="location", all.x = TRUE)

head(my_map)
##         region     long      lat group order subregion continent       date
## 1: Afghanistan 74.89131 37.23164     2    12      <NA>      Asia 2020-06-19
## 2: Afghanistan 74.84023 37.22505     2    13      <NA>      Asia 2020-06-19
## 3: Afghanistan 74.76738 37.24917     2    14      <NA>      Asia 2020-06-19
## 4: Afghanistan 74.73896 37.28564     2    15      <NA>      Asia 2020-06-19
## 5: Afghanistan 74.72666 37.29072     2    16      <NA>      Asia 2020-06-19
## 6: Afghanistan 74.66895 37.26670     2    17      <NA>      Asia 2020-06-19
##    total_cases new_cases total_deaths total_tests_per_thousand
## 1:       27882       346          552                       NA
## 2:       27882       346          552                       NA
## 3:       27882       346          552                       NA
## 4:       27882       346          552                       NA
## 5:       27882       346          552                       NA
## 6:       27882       346          552                       NA
# It is vital that the map data is ordered by the order column. Otherwise, the map will be messed up. 
my_map[order(order)]
##         region      long      lat group  order subregion continent       date
##     1:   Aruba -69.89912 12.45200     1      1      <NA>      <NA>       <NA>
##     2:   Aruba -69.89571 12.42300     1      2      <NA>      <NA>       <NA>
##     3:   Aruba -69.94219 12.43853     1      3      <NA>      <NA>       <NA>
##     4:   Aruba -70.00415 12.50049     1      4      <NA>      <NA>       <NA>
##     5:   Aruba -70.06612 12.54697     1      5      <NA>      <NA>       <NA>
##    ---                                                                       
## 99334: Vatican  12.43838 41.90620  1627 100960   enclave    Europe 2020-06-19
## 99335: Vatican  12.43057 41.90547  1627 100961   enclave    Europe 2020-06-19
## 99336: Vatican  12.42754 41.90073  1627 100962   enclave    Europe 2020-06-19
## 99337: Vatican  12.43057 41.89756  1627 100963   enclave    Europe 2020-06-19
## 99338: Vatican  12.43916 41.89839  1627 100964   enclave    Europe 2020-06-19
##        total_cases new_cases total_deaths total_tests_per_thousand
##     1:          NA        NA           NA                       NA
##     2:          NA        NA           NA                       NA
##     3:          NA        NA           NA                       NA
##     4:          NA        NA           NA                       NA
##     5:          NA        NA           NA                       NA
##    ---                                                            
## 99334:          12         0           NA                       NA
## 99335:          12         0           NA                       NA
## 99336:          12         0           NA                       NA
## 99337:          12         0           NA                       NA
## 99338:          12         0           NA                       NA

Now, we are ready to plot the data:

g<-ggplot(my_map, aes(x=long,y=lat,group=group))+
  geom_polygon(aes(fill=new_cases))+
  geom_path()+ 
  coord_map(xlim=c(-180,180))+
  labs(x = NULL, y = NULL, title = paste("New confirmed cases worldwide: ", today),fill="New Cases")

g

The default fill color scheme is not quite differentiable. We can use scale_fill_gradient() function to creates a two colour gradient (low-high):

g+ scale_fill_gradient(low="white",high="red",na.value="grey90")

In the above code, we set color scheme of “white” representing low quantity, “red” representing high quantity, the missing value is represented by grey.

Here is another commonly used color scheme. In the code below, heat.colors(10) generate a heat color scheme from white, yellow to red with 10 steps in between. rev() reserves this color scheme because we want to use red to represent high quantity, and yellow to represent low quantity.

g+ scale_fill_gradientn(colours=rev(heat.colors(10)),na.value="grey90")

Now, the map looks OK, but we want to clean the extra grid and axis labels. We can do that by change the theme of the plot. Below is a reusable map theme that is readily to use.

Define the theme of map:

theme_map <- function(...) {
  theme_minimal() +
    theme(
      axis.line = element_blank(),
      axis.text.x = element_blank(),
      axis.text.y = element_blank(),
      axis.ticks = element_blank(),
      axis.title.x = element_blank(),
      axis.title.y = element_blank(),
      panel.grid.minor = element_blank(),
      panel.grid.major = element_blank(),
      plot.background = element_rect(fill = "white", color = NA), 
      panel.background = element_rect(fill = "gray99", color = NA), 
      legend.background = element_rect(fill = "white", color = NA),
      panel.border = element_blank(),
      legend.position = "right",
      ...
    )
}

Apply the self-defined map theme:

g + scale_fill_gradientn(colours=rev(heat.colors(10)),na.value="grey90")+
  theme_map()          

Now, let’s plot the total cases on the map:

ggplot(my_map, aes(x=long,y=lat,group=group))+
  geom_polygon(aes(fill=total_cases))+
  geom_path()+ 
  coord_map(xlim=c(-180,180))+
  labs(x = NULL, y = NULL, title = paste("Total cases worldwide: ", today), fill="Total Cases")+
  scale_fill_gradientn(colours=rev(heat.colors(10)),na.value="grey90")+
  theme_map()

ggsave("data/covid_total_cases.png") # save the chart into local file
## Saving 7 x 5 in image

16.2 Draw US map

In most of the situation, we focus on US as well as states in US. Here we show how to create US/States map. First, we need to load the US map data.

states_map=as.data.table(map_data("state"))
head(states_map)
##         long      lat group order  region subregion
## 1: -87.46201 30.38968     1     1 alabama      <NA>
## 2: -87.48493 30.37249     1     2 alabama      <NA>
## 3: -87.52503 30.37249     1     3 alabama      <NA>
## 4: -87.53076 30.33239     1     4 alabama      <NA>
## 5: -87.57087 30.32665     1     5 alabama      <NA>
## 6: -87.58806 30.32665     1     6 alabama      <NA>

Next, we can plot the US map.

ggplot(data=states_map)+
  geom_polygon(aes(x=long, y=lat, group=group), color="black", fill="gray90") +
  coord_map() +
  theme_map()

We can add the information of each state on the map. E.g., we can fill each state to a color corresponding to its number of new COVID-19 confirmed cases.

First, we need to download the COVID-19 data for each state from the web.

# the data is in the following web-page
url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv'

covid_US=fread(url)

Examine the data through head():

head(covid_US)
##          date      state fips cases deaths
## 1: 2020-01-21 Washington   53     1      0
## 2: 2020-01-22 Washington   53     1      0
## 3: 2020-01-23 Washington   53     1      0
## 4: 2020-01-24   Illinois   17     1      0
## 5: 2020-01-24 Washington   53     1      0
## 6: 2020-01-25 California    6     1      0

Clean the data before any analysis. First, in map_state, the state name is lower case; here state name has upper case, we need to make it also lower case so that we can merge the two datasets. Second, we need to convert date from string to date.

covid_US[,date:=ymd(date)]
covid_US[,state:=tolower(state)]
head(covid_US)
##          date      state fips cases deaths
## 1: 2020-01-21 washington   53     1      0
## 2: 2020-01-22 washington   53     1      0
## 3: 2020-01-23 washington   53     1      0
## 4: 2020-01-24   illinois   17     1      0
## 5: 2020-01-24 washington   53     1      0
## 6: 2020-01-25 california    6     1      0

Merge the state map data with the covid-19 data.

select_date="2020-07-04"
covid_US_today=covid_US[date==select_date]
  
map_covid_state = merge(states_map,covid_US_today, by.x="region", by.y="state", all.x = TRUE)

head(map_covid_state)
##     region      long      lat group order subregion       date fips cases
## 1: alabama -87.46201 30.38968     1     1      <NA> 2020-07-04    1 42862
## 2: alabama -87.48493 30.37249     1     2      <NA> 2020-07-04    1 42862
## 3: alabama -87.52503 30.37249     1     3      <NA> 2020-07-04    1 42862
## 4: alabama -87.53076 30.33239     1     4      <NA> 2020-07-04    1 42862
## 5: alabama -87.57087 30.32665     1     5      <NA> 2020-07-04    1 42862
## 6: alabama -87.58806 30.32665     1     6      <NA> 2020-07-04    1 42862
##    deaths
## 1:   1007
## 2:   1007
## 3:   1007
## 4:   1007
## 5:   1007
## 6:   1007

Next, we can plot the US map.

ggplot(map_covid_state)+
  geom_polygon(aes(x=long, y=lat, group=group, fill=cases), color="black") +
  coord_map() +
  scale_fill_gradientn(colours=rev(heat.colors(10)),na.value="grey90")+
  theme_map()+
  labs(title=paste("Total confimed cases in US: ", select_date, sep=""), fill="New Cases")

ggsave("data/covid_us_map.png")
## Saving 7 x 5 in image

Sometime, we want to add text information on the map:

ggplot(map_covid_state)+
  geom_polygon(aes(x=long, y=lat, group=group, fill=cases), color="black") +
  coord_map() +
  scale_fill_gradientn(colours=rev(heat.colors(10)),na.value="grey90")+
  theme_map()+
  labs(title=paste("Total confimed cases in US: ", select_date, sep=""), fill="New Cases")

ggsave("data/covid_us_map.png")
## Saving 7 x 5 in image

16.3 draw US county map

Sometime, we only want to zoom in to look at a subset of states at the county level. Here we show how to draw US county map.

First, we need to extract the US county map data:

county_map=as.data.table(map_data("county"))
head(county_map)
##         long      lat group order  region subregion
## 1: -86.50517 32.34920     1     1 alabama   autauga
## 2: -86.53382 32.35493     1     2 alabama   autauga
## 3: -86.54527 32.36639     1     3 alabama   autauga
## 4: -86.55673 32.37785     1     4 alabama   autauga
## 5: -86.57966 32.38357     1     5 alabama   autauga
## 6: -86.59111 32.37785     1     6 alabama   autauga

Next, we can draw US county map:

ggplot(data=county_map)+
  geom_polygon(aes(x=long, y=lat, group=group), color="gray")+
  coord_map() +
  theme_map()+
  labs(title="US county map")

Similarly, we can include county information to the map. Take covid-19 as an example:

First, we need to download the data from the following address:

covid_county=fread("https://github.com/nytimes/covid-19-data/raw/master/us-counties.csv")
head(covid_county)
##          date    county      state  fips cases deaths
## 1: 2020-01-21 Snohomish Washington 53061     1      0
## 2: 2020-01-22 Snohomish Washington 53061     1      0
## 3: 2020-01-23 Snohomish Washington 53061     1      0
## 4: 2020-01-24      Cook   Illinois 17031     1      0
## 5: 2020-01-24 Snohomish Washington 53061     1      0
## 6: 2020-01-25    Orange California  6059     1      0

Clean up the data for analysis:

covid_county[,date:=ymd(date)]
covid_county[,county:=tolower(county)]
covid_county[,state:=tolower(state)]
head(covid_county)
##          date    county      state  fips cases deaths
## 1: 2020-01-21 snohomish washington 53061     1      0
## 2: 2020-01-22 snohomish washington 53061     1      0
## 3: 2020-01-23 snohomish washington 53061     1      0
## 4: 2020-01-24      cook   illinois 17031     1      0
## 5: 2020-01-24 snohomish washington 53061     1      0
## 6: 2020-01-25    orange california  6059     1      0

Merge with the map data

# choose the date to map
select_date="06-21-2020"
covid_county2<-covid_county[date==mdy(select_date)]

map_covid_county = merge(county_map,covid_county2, by.x=c("region","subregion"), by.y=c("state","county"), all.x = TRUE)

head(map_covid_county)
##     region subregion      long      lat group order       date fips cases
## 1: alabama   autauga -86.50517 32.34920     1     1 2020-06-21 1001   434
## 2: alabama   autauga -86.53382 32.35493     1     2 2020-06-21 1001   434
## 3: alabama   autauga -86.54527 32.36639     1     3 2020-06-21 1001   434
## 4: alabama   autauga -86.55673 32.37785     1     4 2020-06-21 1001   434
## 5: alabama   autauga -86.57966 32.38357     1     5 2020-06-21 1001   434
## 6: alabama   autauga -86.59111 32.37785     1     6 2020-06-21 1001   434
##    deaths
## 1:      9
## 2:      9
## 3:      9
## 4:      9
## 5:      9
## 6:      9

Next, we can draw US county map:

ggplot(map_covid_county)+
  geom_polygon(aes(x=long, y=lat, fill=cases, group=group), color="gray80")+
  coord_map() +
  scale_fill_gradientn(colours=rev(heat.colors(10)),na.value="grey90")+
  theme_map()+
  labs(title=paste("Total confirmed cases in US: ", select_date), fill="total cases")

The county map is mostly useful when we want to zoom in to look at particular state.

select_state=c("nebraska")
ggplot(map_covid_county[region%in%select_state])+
  geom_polygon(aes(x=long, y=lat, fill=cases, group=group), color="gray80")+
  coord_map() +
  scale_fill_gradientn(colours=rev(heat.colors(10)),na.value="grey90")+
  theme_map()+
  labs(title=paste("Total confirmed cases in ", select_state, " : ", select_date), fill="Total Cases")

ggsave("data/covid_county_map.png")
## Saving 7 x 5 in image

16.4 Summary

  • learn to make maps with ggplot2: world map, US state maps, US county maps

  • learn to include information about on the map.

  • apply data.table to join data.table.

  • Use theme to adjust the appearance of the map