SCMA450 Notes
1
Overview of Business Analytics and R Programming
2
Define Variables
2.1
Use R as a Calculator
2.2
Define variables
2.3
Basic Variable Modes: integer, numeric, character, logical, factor.
2.3.1
Integer and numeric
2.3.2
Logical
2.3.3
Character
2.3.4
Type coercion
2.4
Exercise 1: Greeting from R
2.5
Exercise 2: Mortgage Calculation
2.6
Summary
3
Data Structure: Vector, Matrix, List and Factor
3.1
Vector
3.1.1
Subsetting vectors
3.1.2
Vector calculus
3.2
Matrices
3.2.1
subsetting matrix
3.2.2
matrix calculus
3.3
Factors
3.4
List
3.5
Summary
4
Data Frame
4.1
Define a data.frame manually
4.2
Subsetting data.frame
4.3
Import csv data as data.frame.
4.4
Subsetting data frame
4.5
Write a data frame to your computer
4.6
Summary
5
Functions
5.1
Get the Help Document of A Function
5.2
Call build-in functions in R
5.3
Write Your Own Functions
5.4
Flexible Function
5.5
Excercise 1
5.6
The apply() function family
5.7
Scoping Rule in R
5.8
Excercise
6
Flow Control
6.1
if…else Statement
6.2
if…else if…else Statement
6.3
Exercise
6.4
for loop
6.5
Break Your Loops With break
6.6
Jump to the next cycly with next
6.7
Exercise
6.8
While Loop
7
R markdown for documentation
7.1
rock, paper, scissors
7.2
verison 1:
7.3
verison 2
7.4
Version 3
7.5
verision 4
7.6
version 5
8
Use data.table Package for Big Data I
8.1
Install R Packages
8.2
fread(): Read csv file into R as data.table
8.2.1
Convert data.frame into data.table
8.3
Quick summary of the data
8.4
Data Manipulation with data.table
8.4.1
Filtering rows through logical expression
8.4.2
Filtering rows by row index
8.4.3
Fitering rows through %between%, %chin%, %like%
8.5
Select Columns
8.5.1
Using column names to select columns
8.5.2
Select columns as variables - the data.table way
8.5.3
Computing on columns
8.5.4
Computing on rows and columns
8.5.5
Special symbol .N in j
8.5.6
Advanced Computation on columns
8.5.7
Sort Data by One/Multiple Columns
8.6
Calcuation by Group
8.7
Summary
References
9
Use data.table Package for Big Data II
9.1
Use Chaining to Avoid Intermediate Result
9.2
uniqueN()
9.3
Subset of Data: .SD[]
9.4
Use := to Add/Update Columns By Reference
9.5
Binning a continuous value into category
9.6
Expressions in by
9.7
Summary
10
Join Mutliple Datasets
10.0.1
Join files using merge()
10.0.2
inner join
10.0.3
full join
10.0.4
left join/right join
10.1
Merge dataset using data.table syntax
10.2
Chaining Join and Data Manipulation
10.3
Exercise
10.4
Summary
11
Put it together: explore big data with data.table
11.1
Import the COVID-19 case data
11.2
Selecting Rows and Columns
11.3
Selecting Rows and Columns
11.4
Computing on j
11.5
Computing on j
11.6
Computing on j
11.7
Computing on j
11.8
Special symbol .N
11.9
Fitering rows through %between%
11.10
Fitering rows through %chin%
11.11
uniqueN()
11.12
Subset of Data: .SD[ ]
11.13
Subset of Data: .SD[ ]
11.14
Use := to Add/Update Columns By Reference
11.15
Join datasets
11.16
Additional Exercises:
11.17
Summary
12
Data Cleaning I
12.1
Exploring the raw data
12.1.1
Understand the structure of the data
12.1.2
Visulize the data in Tabular form
12.1.3
Visualize the raw data through charts
12.2
Tidying data
12.2.1
melt()
12.2.2
dcast()
12.3
Preparing data for analysis
12.3.1
Type conversions
12.3.2
Working with dates
12.3.3
String (character) manipulation
12.4
Missing and special values in R
12.5
Outliers, obvious error and missing value
12.6
Summary
13
Data Cleaning II
13.1
Explore the raw data
13.2
Summarize the data
13.2.1
Take a closer look
13.3
Tidy the data
13.4
prepare for analysis
13.4.1
Clean up dates
13.4.2
A closer look at column types
13.4.3
Missing, extreme and unexpected values
13.4.4
Finishing touches
13.5
Summary
14
Visualization with ggplot2 I
14.1
Scatter plot
14.2
line-fitting
14.3
Overplotting
14.4
Bar chart and statistical transformation
14.5
stacked par
14.6
Histogram
14.7
boxplot
14.8
line chart
14.9
use theme to customize the appearance of your chart.
14.10
Summary
15
Visualization with ggplot2 II
15.1
Bar chart and statistical transformation
15.2
stacked par
15.3
Histogram
15.4
boxplot
15.5
line chart
15.6
use theme to customize the appearance of your chart.
15.7
Summary
16
Visualization with ggplot2 III
16.1
Draw world map
16.2
Draw US map
16.3
draw US county map
16.4
Summary
17
Visualization with ggplot2 - Animation
17.1
Animation of scatter plot to show how two variables evolve over time simultaneously
17.2
Use a line chart to show how gdpPercap evolve over time
17.3
Animation with Bar chart
18
Simple Linear Regression
18.1
Examine and explore the data
18.1.1
Correlation matrix
18.1.2
Scatter plot
18.2
Develop simple linear regression model
18.2.1
Estimation
18.2.2
How to interpret the coefficient estimates?
18.3
Overall model fittness
18.3.1
Regression Diagnoise
18.4
Use estimated model for prediction
18.5
Summary
19
Log Tranformation and Polynomial Terms
19.1
Log-transformation
19.2
Quadratic regression
19.3
Polynomial regression
19.4
Summary
19.5
Exercise
19.5.1
Solution
20
Mutiple Linear Regression
20.1
Mutiple linear regression
20.2
Categorical variables
20.3
Interprate coefficients - Keeping all other variables the same
20.4
Multicollinearity
20.5
Excerise
20.5.1
Examine the frequency of chas
20.5.2
Visualize the house price with ggplot2
20.5.3
estimate the linear regression model
20.5.4
plot the fitted line
21
Best Subset Model Selection
21.1
Best Subset Model Selection
21.2
Deterimining the best model with adj-R2, Cp, AIC, and BIC
21.2.1
R-squared is not a good criteria for model selection
21.2.2
Select the best model using Adjusted R2
21.2.3
Select the best model using Cp
21.2.4
Select the best model using BIC
21.3
Forward/Backward selection
21.4
Summary:
22
Model Selection with Training/Test Set
22.1
Create training/test set
22.2
Model selection
22.2.1
estimate the model using training set
22.2.2
evaluate the models’ out-of-sample prediction using test set
22.3
Model 4 has the problem of over-fitting
22.4
Other Commonly Used Metrics for Prediction performance
22.5
Summary
23
Best Subset Model Selection with Training/Test Set
23.1
Create training/test set
23.2
Evaluate the Best subset regression through training/test set
23.3
Exercise
24
Decision Tree
25
Maching Learning with Caret
25.1
Random Forest
25.2
Random Forest
Published with bookdown
Notes for SCMA450: Data Modeling and Computing
References