Chapter 4 Data Frame

A data frame is more general than a matrix in that different columns can contain different modes of data (numeric, character, factor, logical etc.). We can think of a data frame as a excel sheet where each row represents an observation, while each column represents a variable associated with the observation. Data frames are the basic data structure you’ll deal with in R.

4.1 Define a data.frame manually

Our first exercise is to define a data frame manually to help you familize the functions related to data frame.

Suppose we are building a database of diabetes patients. For simplicity, suppose we have only four patients, with the following patient information:

patientID = c(1, 2, 3, 4) 
age = c(25, 34, 28, 52) 
diabetes = c("Type1", "Type2", "Type1", "Type1")  
status = c("Poor", "Median", "Good", "Poor")

We want to combine these information into a data frame. The function to construct data fram is as below: mydata = data.frame(col1, col2, col3,…) where col1, col2, col3, … are the column of the data frame. Names for each column can be provided with the names function. The following code makes this clear.

# this create an empty data frame. 
# We sometime need to do that when we do not know the ultimate dimension of the data frame.
patientdata = data.frame()  

patientdata = data.frame(patientID, age, diabetes, status)

# display the data frame on the screen to visually check the dataset
patientdata
##   patientID age diabetes status
## 1         1  25    Type1   Poor
## 2         2  34    Type2 Median
## 3         3  28    Type1   Good
## 4         4  52    Type1   Poor

We can use str() to examine the structure of the data frame. str() is a very useful function to examine an object in R. We will use this function often.

str(patientdata)
## 'data.frame':    4 obs. of  4 variables:
##  $ patientID: num  1 2 3 4
##  $ age      : num  25 34 28 52
##  $ diabetes : chr  "Type1" "Type2" "Type1" "Type1"
##  $ status   : chr  "Poor" "Median" "Good" "Poor"

Note that each column of a data frame must have have same data mode, but you can put columns of different modes together to form the data frame. Because data frames are close to what analysts typically think of as datasets, we’ll use the terms columns and variables interchangeably when discussing data frames.

4.2 Subsetting data.frame

There are several ways to identify the elements of a data frame. You can use the index notation you used before (for example, with matrices) or you can specify column names.

Subset the first 2 column of the patientdata by index:

 patientdata[,1:2]
##   patientID age
## 1         1  25
## 2         2  34
## 3         3  28
## 4         4  52

Subset the first two columns of the data frame by variable names:

patientdata[c("diabetes", "status")]
##   diabetes status
## 1    Type1   Poor
## 2    Type2 Median
## 3    Type1   Good
## 4    Type1   Poor

Subset one specific column (e.g., age) by its name:

patientdata$age
## [1] 25 34 28 52

Now, change diabetes and status into factors because they are categorical variables:

patientdata$diabetes<-factor(patientdata$diabetes)
patientdata$status<-factor(patientdata$status, ordered = TRUE, levels = c("Poor", "Median","Good") )

patientdata
##   patientID age diabetes status
## 1         1  25    Type1   Poor
## 2         2  34    Type2 Median
## 3         3  28    Type1   Good
## 4         4  52    Type1   Poor
str(patientdata)
## 'data.frame':    4 obs. of  4 variables:
##  $ patientID: num  1 2 3 4
##  $ age      : num  25 34 28 52
##  $ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 1
##  $ status   : Ord.factor w/ 3 levels "Poor"<"Median"<..: 1 2 3 1

4.3 Import csv data as data.frame.

In most application, we will import dataset (i.e., csv, txt, excel files) from local computer into R as data frame. Here we will learn how to read data into R as data frame.

The famous Auto.csv is a very standard dataset for R programming. The Auto.csv contains the basic information of over 300 kinds of Auto. We will learn to import this dataset.

I have uploaded the Auto.csv file in the R-studio cloud to aviod the trouble of finding the file in each individuals’ computer. Look at the files in the bottom right pannel. The file is in csv format, thus we will use read.csv() function.

Auto=read.csv("data/Auto.csv")
# examine the structure of Auto dataset
str(Auto)
## 'data.frame':    392 obs. of  10 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ rating      : int  3 1 3 2 1 2 1 2 1 1 ...
##  $ origin      : chr  "US" "US" "US" "US" ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...

Unlikely excel which shows data in a tabular format, R uses different ways for us to visually examine the data. We can use head() and tail() to examine first/last 6 rows of the data.

# print the first 6 rows for visually examining the dataset
head(Auto)  
##   mpg cylinders displacement horsepower weight acceleration year rating origin
## 1  18         8          307        130   3504         12.0   70      3     US
## 2  15         8          350        165   3693         11.5   70      1     US
## 3  18         8          318        150   3436         11.0   70      3     US
## 4  16         8          304        150   3433         12.0   70      2     US
## 5  17         8          302        140   3449         10.5   70      1     US
## 6  15         8          429        198   4341         10.0   70      2     US
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500
tail(Auto)
##     mpg cylinders displacement horsepower weight acceleration year rating
## 387  27         4          151         90   2950         17.3   82      2
## 388  27         4          140         86   2790         15.6   82      3
## 389  44         4           97         52   2130         24.6   82      3
## 390  32         4          135         84   2295         11.6   82      3
## 391  28         4          120         79   2625         18.6   82      2
## 392  31         4          119         82   2720         19.4   82      4
##     origin             name
## 387     US chevrolet camaro
## 388     US  ford mustang gl
## 389     EU        vw pickup
## 390     US    dodge rampage
## 391     US      ford ranger
## 392     US       chevy s-10

We can use summary() function to quickly examine the distribution of each variable.

summary(Auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##   acceleration        year           rating        origin         
##  Min.   : 8.00   Min.   :70.00   Min.   :1.00   Length:392        
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:2.00   Class :character  
##  Median :15.50   Median :76.00   Median :2.00   Mode  :character  
##  Mean   :15.54   Mean   :75.98   Mean   :2.48                     
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:3.00                     
##  Max.   :24.80   Max.   :82.00   Max.   :4.00                     
##      name          
##  Length:392        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

read.csv() is the function to read csv file into R as a data.frame. When using the function, you need to tell exactly where to find the data file. For example, if Auto.csv is saved in my computer at: “C:/Users/lxu25/Dropbox/Teaching/UNL Courses/SCMA 450/code/Auto.csv”. We can use the following code to read the file:

Auto=read.csv("C:/Users/lxu25/Dropbox/Teaching/UNL Courses/SCMA 450/code/Auto.csv") 

Notice, in the the file directory is “/”, not ". However, locating this data file can be a hassle, especially when your code is shared with others in your team because the data has a totally different directory in others’s computer. Therefore, it is the best practice to always create a seperate folder for your project and save both code and data in that folder. Since your data is in the same folder of your code, we can simply use the filename to locate the data file. Then share the folder together to others.

4.4 Subsetting data frame

Select by rows by index.

# type Auto[2,] to select the 2nd row (or called 2nd observation)
Auto[2,]  
##   mpg cylinders displacement horsepower weight acceleration year rating origin
## 2  15         8          350        165   3693         11.5   70      1     US
##                name
## 2 buick skylark 320
# type Auto[c(3,5,7),] to select the 3rd, 5th, and 7th rows.
Auto[c(3,5,7),]
##   mpg cylinders displacement horsepower weight acceleration year rating origin
## 3  18         8          318        150   3436         11.0   70      3     US
## 5  17         8          302        140   3449         10.5   70      1     US
## 7  14         8          454        220   4354          9.0   70      1     US
##                 name
## 3 plymouth satellite
## 5        ford torino
## 7   chevrolet impala

Select column by index.

# type Auto[,4] to select the 4th column (i.e., horsepower); 
head(Auto[,4])
## [1] 130 165 150 150 140 198
# tpye hist(Auto[,4]) to plot the distribution of horsepower
hist(Auto[,4])

# You can customize the histgram to make it look nicer
hist(Auto[,4],xlab="horsepower", col="gray",breaks = 50, main=
       "frequency of horsepower")  

hist() plot the histogram of a variable. You can use ?hist to examine the detail of the function.

?hist  # a quick way to look at how to use this function of 

We can select multiple columns (variables) at the same time.

# select the 2nd and 4th column (i.e., cylinders and horsepower )
head(Auto[,c(2,4)])
##   cylinders horsepower
## 1         8        130
## 2         8        165
## 3         8        150
## 4         8        150
## 5         8        140
## 6         8        198

We can create a scatter plot between these two variables to see their relationship. This also shows the reason why want to subset data frame.

plot(Auto[,c(2,4)])

The scatter plot clearly demonstrates that a higher horsepower is assoicated with a larger number of cylinders.

We can also subset the dataset by variable name. We need to use $ for subsetting one variable with its variable name.

# Auto$mpg selects the variable mpg
# type hist[Auto$mpg] to get its distribution
hist(Auto$acceleration)

# select multiple columns by name.
head( Auto[,c("horsepower","mpg","weight")]  )
##   horsepower mpg weight
## 1        130  18   3504
## 2        165  15   3693
## 3        150  18   3436
## 4        150  16   3433
## 5        140  17   3449
## 6        198  15   4341

Filter rows by condition:

# select all rows with mpg less equal to 12
Auto[Auto$mpg<=12,]
##     mpg cylinders displacement horsepower weight acceleration year rating
## 26   10         8          360        215   4615         14.0   70      2
## 27   10         8          307        200   4376         15.0   70      2
## 28   11         8          318        210   4382         13.5   70      2
## 29    9         8          304        193   4732         18.5   70      2
## 42   12         8          383        180   4955         11.5   71      1
## 67   11         8          429        208   4633         11.0   72      2
## 69   12         8          350        160   4456         13.5   72      1
## 90   12         8          429        198   4952         11.5   73      2
## 95   12         8          455        225   4951         11.0   73      1
## 103  11         8          400        150   4997         14.0   73      1
## 104  12         8          400        167   4906         12.5   73      2
## 106  12         8          350        180   4499         12.5   73      2
## 124  11         8          350        180   3664         11.0   73      2
##     origin                       name
## 26      US                  ford f250
## 27      US                  chevy c20
## 28      US                 dodge d200
## 29      US                   hi 1200d
## 42      US          dodge monaco (sw)
## 67      US            mercury marquis
## 69      US oldsmobile delta 88 royale
## 90      US   mercury marquis brougham
## 95      US   buick electra 225 custom
## 103     US           chevrolet impala
## 104     US               ford country
## 106     US   oldsmobile vista cruiser
## 124     US           oldsmobile omega
# select all rows with mpg less equal to 12 and greater equal to 11
Auto[Auto$mpg>=11 & Auto$mpg<=12,]
##     mpg cylinders displacement horsepower weight acceleration year rating
## 28   11         8          318        210   4382         13.5   70      2
## 42   12         8          383        180   4955         11.5   71      1
## 67   11         8          429        208   4633         11.0   72      2
## 69   12         8          350        160   4456         13.5   72      1
## 90   12         8          429        198   4952         11.5   73      2
## 95   12         8          455        225   4951         11.0   73      1
## 103  11         8          400        150   4997         14.0   73      1
## 104  12         8          400        167   4906         12.5   73      2
## 106  12         8          350        180   4499         12.5   73      2
## 124  11         8          350        180   3664         11.0   73      2
##     origin                       name
## 28      US                 dodge d200
## 42      US          dodge monaco (sw)
## 67      US            mercury marquis
## 69      US oldsmobile delta 88 royale
## 90      US   mercury marquis brougham
## 95      US   buick electra 225 custom
## 103     US           chevrolet impala
## 104     US               ford country
## 106     US   oldsmobile vista cruiser
## 124     US           oldsmobile omega

We can perform operation on existing variable and create new variables and add these new variable into the data frame.

E.g., The following code creates a new variable called horsepowerpercylinder, which is horsepower divided by # of cylinders, # and adds to the data frame.

Auto$horsepowerpercylinder=Auto$horsepower/Auto$cylinders
str(Auto)
## 'data.frame':    392 obs. of  11 variables:
##  $ mpg                  : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders            : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement         : num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower           : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight               : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration         : num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year                 : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ rating               : int  3 1 3 2 1 2 1 2 1 1 ...
##  $ origin               : chr  "US" "US" "US" "US" ...
##  $ name                 : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
##  $ horsepowerpercylinder: num  16.2 20.6 18.8 18.8 17.5 ...

We can also rename variables using the names() function.

# eamine the existing name of the Auto datafile.
names(Auto)
##  [1] "mpg"                   "cylinders"             "displacement"         
##  [4] "horsepower"            "weight"                "acceleration"         
##  [7] "year"                  "rating"                "origin"               
## [10] "name"                  "horsepowerpercylinder"
# change the 11th variable name to power_per_cylinder
names(Auto)[11]="power_per_cylinder"
str(Auto)
## 'data.frame':    392 obs. of  11 variables:
##  $ mpg               : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders         : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement      : num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower        : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight            : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration      : num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year              : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ rating            : int  3 1 3 2 1 2 1 2 1 1 ...
##  $ origin            : chr  "US" "US" "US" "US" ...
##  $ name              : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
##  $ power_per_cylinder: num  16.2 20.6 18.8 18.8 17.5 ...

To delete a variable from data frames:

# Delete the variable power_per_cylinder
Auto=Auto[,-c(11)]
str(Auto)
## 'data.frame':    392 obs. of  10 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ rating      : int  3 1 3 2 1 2 1 2 1 1 ...
##  $ origin      : chr  "US" "US" "US" "US" ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...

To access the mpg in Auto data frame, we must use $ (e.g., Auto$mpg). If you simply use mpg, R would not know which data frame to look for mpg. But this causes the code too long and hard to read. We can use with() function to avoid that. E.g., plot the scatter plot between weight and mpg

with(Auto, plot(weight, mpg) )

The general format of with() function is to: with(DATA FRAME, CODE USING THE DATA FRAME)

4.5 Write a data frame to your computer

We can use write.csv() to write the data frame to your local computer.

write.csv(Auto,"data/Auto2.csv", row.names =FALSE)

Note, row.names = TRUE will generate an automatic ID (i.e., 1,2,3, …) for each rows.

4.6 Summary

  • While all elements in vector and matrix must be the same type, data.frame allows different columns (i.e., variable) to be different types.

  • use Base R for creating plots.

  • subsetting data.frame using index and variable names.

  • Use read.csv() to read csv file into R as data.frame; learn to subset data.frame.

  • Use write.csv() to write data.frame from R to your local computer.