Week 3 - Regression Methods

Important

Due date: Lab 3 - Sunday, Sept 28, 5pm ET

Prepare

https://rpubs.com/TimothyOyebamiji/housepriceml

📖 Get a quick overview here: Regression Analysis in R Programming

📖 Read Chapter 2 - General Aspects of Fitting Regression Models in: Regression Modeling Strategies

📖 Read Chapter (8.1-8.5) - Regression Models in: Modern Statistics in R

📖 Follow along with the R code in Linear Regression in R: Linear Regression Hands on Tutorial

📖 Follow along with the R code from R code for Regression Analysis in An R companion

📖 Check out Regression and Other Stories - Examples: Regression and Other Stories Examples

📖 Read Different ways of calculating OLS regression coefficients (in R): Different ways of calculating OLS regression coefficients (in R)

📖 Read How does glmnet perform ridge regression?: How does glmnet perform ridge regression?

📖 Read a case study on house prices: Predicting House Prices Using Machine Learning Models

📖 Read Understanding Lasso and Ridge Regression: Understanding Lasso and Ridge Regression

Participate

🖥️ Lecture 3 - Regression Methods

Perform

⌨️ Lab 3 - Regression Methods

⌨️ Example 1: grouped data & weighted regression

From Section 10.8 of Regression and Other Stories:

Three models leading to weighted regression

Weighted least squares can be derived from three different models:

Using observed data to represent a larger population. This is the most common way that regression weights are used in practice. A weighted regression is fit to sample data in order to estimate the (unweighted) linear model that would be obtained if it could be fit to the entire population. For example, suppose our data come from a survey that oversamples older white women, and we are interested in estimating the population regression. Then we would assign to survey respondent a weight that is proportional to the number of people of that type in the population represented by that person in the sample. In this example, men, younger people, and members of ethnic minorities would have higher weights. Including these weights in the regression is a way to approximately minimize the sum of squared errors with respect to the population rather than the sample.
Duplicate observations. More directly, suppose each data point can represent one or more actual observations, so that i represents a collection of w_i data points, all of which happen to have x_i as their vector of predictors, and where y_i is the average of the corresponding wi outcome variables. Then weighted regression on the compressed dataset, (x, y, w), is equivalent to unweighted regression on the original data.
Unequal variances. From a completely different direction, weighted least squares is the maximum likelihood estimate for the regression model with independent normally distributed errors with unequal variances, where sd(ε_i) is proportional to 1/√w_i . That is, measurements with higher variance get lower weight when fitting the model. As discussed further in Section 11.1, unequal variances are not typically a major issue for the goal of estimating regression coefficients, but they become more important when making predictions about individual cases.

We will use weighted regression later in the course (Lectures 7 & 8), using observed data to represent a larger population - case 1 above.

Here’s an example of the second case:

# check if 'librarian' is installed and if not, install it
if (! "librarian" %in% rownames(installed.packages()) ){
  install.packages("librarian")
}
  
# load packages if not already loaded
librarian::shelf(dplyr, broom)


  The 'cran_repo' argument in shelf() was not set, so it will use
  cran_repo = 'https://cran.r-project.org' by default.

  To avoid this message, set the 'cran_repo' argument to a CRAN
  mirror URL (see https://cran.r-project.org/mirrors.html) or set
  'quiet = TRUE'.

set.seed(1024)

# individual (true) dataset, with 100,000 rows
x <- round(rnorm(1e5))
y <- round(x + x^2 + rnorm(1e5))
ind <- data.frame(x, y)

# aggregated dataset: grouped
agg <- ind %>%
  dplyr::group_by(x, y) |> 
  dplyr::summarize(freq = dplyr::n(), .groups = 'drop') 

models <- list( 
  "True"                = lm(y ~ x, data = ind),
  "Aggregated"          = lm(y ~ x, data = agg),
  "Aggregated & W"      = lm(y ~ x, data = agg, weights=freq)
)

models[['True']] |> broom::tidy(conf.int = TRUE)

# A tibble: 2 × 7
  term        estimate std.error statistic p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 (Intercept)     1.08   0.00580      187.       0    1.07       1.10
2 x               1.01   0.00558      181.       0    0.998      1.02

models[['Aggregated']] |> broom::tidy(conf.int = TRUE)

# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)    5.51      0.717      7.69 8.74e-11    4.08       6.95
2 x              0.910     0.302      3.01 3.69e- 3    0.306      1.51

models[['Aggregated & W']] |> broom::tidy(conf.int = TRUE)

# A tibble: 2 × 7
  term        estimate std.error statistic    p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>      <dbl>    <dbl>     <dbl>
1 (Intercept)     1.08     0.224      4.84 0.00000795    0.637      1.53
2 x               1.01     0.216      4.68 0.0000145     0.579      1.44

Note the differences in the coefficient estimate for $x$ and the corresponding standard errors.

Study

Short Answer Questions

Answer each question in 2-3 sentences.

Explain the primary difference between a simple linear regression model and a generalized linear model (GLM) in terms of their assumptions about the outcome variable.
What is the main purpose of using a Taylor series approximation in the context of linear regression models, and why are only a few terms typically used?
Describe the concept of collinearity in linear regression and name two potential negative effects it can have on model estimates.
How do Ridge and Lasso regression differ in their approach to regularization, and what unique benefit does Lasso provide?
Explain the bias-variance trade-off in the context of model complexity. Give an example of a model exhibiting high bias and low variance.
What is the core idea behind Ordinary Least Squares (OLS) estimation, and what assumption about the error term is crucial for its derivation?
How does the QR decomposition method address the numerical stability issues encountered when solving for OLS coefficients in matrix form?
Briefly describe the main difference in how Random Forests and Gradient Boosting Machines (GBMs) build their ensembles of trees.
What are the key advantages of Kernel Regression, and what are its primary disadvantages concerning computational efficiency and parameter tuning?
In a neural network for regression, what roles do activation functions and loss functions play during the training process?

Answer Key

Simple linear regression (SLR) assumes the outcome variable follows a Normal (Gaussian) probability distribution. In contrast, Generalized Linear Models (GLMs) generalize this by allowing the outcome variable to follow any probability distribution from the exponential family (e.g., Poisson, Binomial), connected to the linear predictor via a link function.
A Taylor series approximates a continuous function as a sum of simpler polynomial functions. In linear regression, this allows for the decomposition of the mean function, and typically only the first few terms are used because for smooth functions, higher-order coefficients decrease rapidly, simplifying the model while retaining continuity.
Collinearity refers to a direct linear relationship or high correlation among the predictor variables in a regression model. This problem can lead to implausible coefficient signs, making the interpretation difficult, and it can cause the $X’X$ matrix to become near or exactly singular, preventing its inversion for coefficient estimation.
Ridge regression penalizes the $L_2$ norm of the weights, shrinking coefficients towards zero but rarely making them exactly zero. Lasso regression penalizes the $L_1$ norm, which has the unique benefit of performing automatic feature selection by forcing some coefficients to become exactly zero, effectively removing those predictors from the model.
The bias-variance trade-off illustrates that as model complexity increases, bias generally decreases (better fit to training data), but variance tends to increase (more sensitive to specific training data). A model exhibiting high bias and low variance would be an underfitted model, such as using a simple linear model to predict house prices when the true relationship is highly non-linear and involves many factors.
Ordinary Least Squares (OLS) aims to minimize the residual sum of squares, finding the coefficients that result in the smallest squared differences between observed and predicted values. A crucial assumption for its derivation is that the error term has a zero mean and is independent of the covariates, implying $E[u]=0$ and $E[u|x]=0$.
Directly computing the inverse of the $X’X$ matrix can be numerically unstable, especially for ill-conditioned matrices. The QR decomposition method factorizes the design matrix X into an orthogonal matrix Q and an upper triangular matrix R, allowing the regression coefficients to be solved via back-substitution, which is a more stable numerical approach.
Random Forests build an ensemble of deep, independent trees, where trees are decorrelated through bootstrap sampling and split-variable randomization at each node. Gradient Boosting Machines (GBMs), in contrast, build an ensemble of shallow, weak successive trees, with each new tree trained to correct the errors (residuals) made by the combined ensemble of all previous trees.
Advantages of Kernel Regression include its flexibility to capture complex, non-linear relationships without assuming a specific functional form. Its disadvantages are that it can be computationally intensive, especially with large datasets, as it requires calculating weights for all data points for each estimate, and its results are highly sensitive to the choice of kernel function and bandwidth, necessitating careful tuning.
In a neural network, activation functions introduce non-linearity to the model, allowing it to learn complex patterns and relationships that linear models cannot capture. The loss function quantifies the difference between the network’s predicted output and the actual target values, serving as the objective that the optimization algorithm (e.g., gradient descent) tries to minimize during training.

Essay Format Questions

Discuss the relationship between Taylor series approximations and the common assumption of linearity in regression models. Explain how the concept of “smoothness” of the mean function influences the choice of complexity (number of terms/covariates) in a linear regression model, and its implications for the bias-variance trade-off.
Compare and contrast Ordinary Least Squares (OLS) regression with Ridge and Lasso regression. Explain the regularization mechanisms of each penalized method, their effect on model coefficients, and when one might be preferred over the others. Include a discussion of their impact on the bias-variance trade-off.
Explain the concept of a Generalized Linear Model (GLM) by detailing its three key components. Choose two specific types of GLMs (e.g., Logistic, Poisson, Gamma) and describe a scenario where each would be appropriate, including the characteristics of their response variables and their respective link functions.
Analyze the evolution of tree-based regression methods from single regression trees to bagging and then to random forests. Discuss the limitations of a single regression tree and explain how bagging and random forests address these limitations. What additional mechanism does random forests introduce to further improve performance over bagging?
Describe the fundamental differences in approach between Kernel Regression and Artificial Neural Networks (ANNs) for non-parametric regression. Discuss the advantages and disadvantages of each method, considering factors such as flexibility, computational intensity, data requirements, and interpretability.

Back to course schedule ⏎