# check if 'librarian' is installed and if not, install it
if (! "librarian" %in% rownames(installed.packages()) ){
install.packages("librarian")
}
# load packages if not already loaded
librarian::shelf(
tidyverse, broom, rsample, ggdag, causaldata, halfmoon, ggokabeito, malcolmbarrett/causalworkshop
, magrittr, ggplot2, estimatr, Formula, r-causal/propensity, gt, gtExtras, timetk, modeltime)
# set the default theme for plotting
theme_set(theme_bw(base_size = 18) + theme(legend.position = "top"))2025-midterm
SOLUTIONS
Packages
Part 1
Q1
What is the primary purpose of the ‘I’ (Integrated) component in an ARIMA model?:
Q2
What is the main purpose of the k-means clustering algorithm?:
Q3
In the context of directed acyclic graphs (DAGs), what do confounders represent?:
Q4
For the binary classifier with the confusion matrix below:
The specificity (i.e. TNR) of this binary classifier is approximately:
Q5
In exponential smoothing, what does the smoothing factor (alpha) primarily control?
Q6
What is the purpose of an adjustment set in causal inference?:
Q7
What is a common disadvantage of lazy machine learning learners like k-Nearest Neighbors (k-NN)?:
Q8
How many open paths are in the DAG above?
Q9
When is accuracy a potentially misleading metric for evaluating a classification model?:
Q10
What is the ‘naive’ assumption made in Naive Bayes classification?:
Part 2
Q11
Execute the following code to create simulated observational data, where D is the treatment variable and Y is the response variable.
set.seed(8740)
n <- 800
V <- rbinom(n, 1, 0.2)
W <- 3*V + rnorm(n)
D <- V + rnorm(n)
Y <- D + W^2 + 1 + rnorm(n)
Z <- D + Y + rnorm(n)
data_obs <- tibble::tibble(V=V, W=W, D=D, Y=Y, Z=Z)In the code below we fit several different outcome models. Compare the resulting coefficients for D. Which regressions appear to lead to unbiased estimates of the causal effect? (2 points)
#
extract_CI <- function(m, str){
broom::tidy(m,conf.int = TRUE) |>
dplyr::slice(2) |>
dplyr::select(term, estimate, conf.low, conf.high) |>
dplyr::mutate(model = str, .before = 1)
}
# linear model of Y on X
lin_YX <- lm(Y ~ D, data=data_obs) |> extract_CI("YX")
# linear model of Y on X and V
lin_YV <- lm(Y ~ D + V, data=data_obs) |> extract_CI("YV")
# linear model Y on X and W
lin_YW <- lm(Y ~ D + W, data=data_obs) |> extract_CI("YW")
dplyr::bind_rows( lin_YX, lin_YV, lin_YW) # A tibble: 3 × 5
model term estimate conf.low conf.high
<chr> <chr> <dbl> <dbl> <dbl>
1 YX D 2.27 1.99 2.55
2 YV D 0.921 0.724 1.12
3 YW D 1.30 1.10 1.51
Answer the questions below and list all valid adjustment sets for the causal structure in this data (a good first step is to sketch the causal relations between variables - you don’t need ggdag::dagify - just look at the data spec). (2 points)
Q12
For this question we’ll use the Spam Classification Dataset available from the UCI Machine Learning Repository. It features a collection of spam and non-spam emails represented as feature vectors, making it suitable for a logistic regression model. The data is in your data/ directory and the metadata is in the data/spambase/ directory.
We’ll use this data to create a model for detecting email spam using a boosted tree model.
spam_data <- readr::read_csv('data/spam.csv', show_col_types = FALSE) %>%
tibble::as_tibble() %>%
dplyr::mutate(type = forcats::as_factor(type))(1) Split the data into test and training sets, and create a default recipe and a default model specification. Use the xgboost engine for the model, with mtry = 10 & tree_depth = 5. (1 point)
(2) create a default workflow object with the recipe and the model specification, fit the workflow using parnsip::fit and the training data, and then generate the testing results by applying the fit to the testing data using broom::augment . (1.5 point)
(3) Evaluate the testing results by plotting the roc_auc curve, and calculating the accuracy. (1 point)
(4) Take your fitted model and extract the fit using (lm_fit |> workflows::extract_fit_parsnip())$fit, and then compute the feature importance matrix and identify the most important feature. (1.5 points)
Q13
When preprocessing data for time series models, what is the function
recipes::step_lag()used for? (1.5 points)Give an example of its use in a recipe that is engineered by use with weekly data records. (1.5 points)
Q-14
A peer-reviewed paper by researchers at the Harvard Medical School, UCLA School of Medicine, and the Department of Emergency Medicine at the University of Michigan (Robert Yeh et al., 2018) has the following abstract describing the work and its results: