Lab 7 - Causality: DAGs

BSMM 8740 Fall 2025

Author

Add your name here

Introduction

In today’s lab, you’ll practice working with

Potential Outcomes and causal estimands,
Building DAGs and extracting adjustment sets,
How to achieve exchangeability through RCT and IPW,
The FWL theorem to understand the structure of regression,
Causal worklows,
and Instrumental Variables.

Learning goals

By the end of the lab you will…

Be able to build DAGs to model causal assumptions and use the causal model to answering causal questions.
Be able to build a causal workflow for real world causal problems.

Getting started

To complete the lab, log on to your github account and then go to the class GitHub organization and find the 2025-lab-7-[your github username] repository .

Create an R project using your 2025-lab-7-[your github username] repository (remember to create a PAT, etc.) and add your answers by editing the 2025-lab-7.qmd file in your repository.
When you are done, be sure to: save your document, stage, commit and push your work.

Important

To access Github from the lab, you will need to make sure you are logged in as follows:

username: .\daladmin
password: Business507!

Remember to (create a PAT and set your git credentials)

create your PAT using usethis::create_github_token() ,
store your PAT with gitcreds::gitcreds_set() ,
set your username and email with
- usethis::use_git_config( user.name = ___, user.email = ___)

Packages

# check if 'librarian' is installed and if not, install it
if (! "librarian" %in% rownames(installed.packages()) ){
  install.packages("librarian")
}
  
# load packages if not already loaded
librarian::shelf(
  tidyverse, broom, rsample, ggdag, causaldata, halfmoon, ggokabeito, malcolmbarrett/causalworkshop
  , magrittr, ggplot2, estimatr, Formula, r-causal/propensity, gt, gtExtras, patchwork)

# set the default theme for plotting
theme_set(theme_bw(base_size = 18) + theme(legend.position = "top"))

Package installs

If you install packages outside of the ones below, remove your install.packages(.) code before submitting your solutions.

Leaving instructions that will install packages may cause your code to fail, but more importantly, it can corrupt the TA/instructor’s system.

Exercise 1: ATT, ATU, ATE

Open the spreadsheet at data/outcomes.xlsx. This spreadsheet contains potential outcomes due to treatment for 10 individuals, including their treatment status.

Fill out the treatment effect and outcome columns in the spreadsheet, and then compute the ATT, ATT, and ATU

YOUR ANSWER Q1:

random treatment:

ATT is ____
ATU is ____
ATE is ____

non-random treatment:

ATT is ____
ATU is ____
ATE is ____

Exercise 2: DAGs and open paths

Find the open paths from D (treatment) to Y (outcome) in the four DAGs below.

You can examine the DAGS to identify the open paths (this is the recommended first step) and then use the code for the DAGs (below), along with the function dagitty::paths to confirm.

YOUR ANSWER Q2:

DAG_1
- open paths from D (treatment) to Y (outcome):
DAG_2
- open paths from D (treatment) to Y (outcome):
DAG_3
- open paths from D (treatment) to Y (outcome):
DAG_4
- open paths from D (treatment) to Y (outcome):

Exercise 3: Building a DAG

You work for a company that sells a commodity to retail customers, and your management is interested in the relationship between your price and the demand for the commodity at your outlets. You have one competitor and your pricing tactic is to set your price at slightly less that your competitor’s. Your company surveys the competitor’s prices several times per day and once you know the competitor’s price, the pricing team resets your prices according to the pricing tactic. The public is well informed of both prices when they make their choice to buy.

You and your competitor buy from the wholesaler at a price that is set by the global market, and the wholesaler’s price is reset at the beginning of the each day according to the market price at the end of the day before. As the market is traded globally it reflects global demand for the commodity as well as other global and local economic shocks that you customers might be exposed to (interest rates, general business conditions, wages, etc.).

Your company has panel data (i.e time-stamped) on local economic conditions, its own sales, its own prices, competitor prices and wholesale prices, and has asked you to do an analysis of the pricing tactics to increase demand.

To confirm your understanding of the business, perhaps identify missing data, and to inform your analysis, create a DAG describing the assumed relationships between the driving factors for this problem.
What data might be missing from dataset provided by the company?

YOUR ANSWER Q3:

Finish constructing the DAG and use this code to display it

set.seed(8740)
# fill in the missing node(s)
ggdag::dagify(
  demand ~ ❓           # demand for the commodity at your business
  , price ~ ❓          # price for the commodity at your business
  , c_price ~❓         # competitor price for the commodity 
  , wholesale ~ ❓      # wholesale price for all businesses
  , economy ~ ❓        # a measure of general economic activity
  , exposure = "price"
  , outcome = "demand"
  , labels = c(
      demand = "demand",
      price = "price",
      c_price = "c_price",
      economy = "economy",
      wholesale = "wholesale"
    )
) %>%
  ggdag::ggdag(use_labels = "label", text = FALSE) +
  ggdag::theme_dag()

What data might be missing from the company data / this DAG, if any: ________________________

Exercise 4: Inverse Probability Weights (IPWs)

in class we used the function propensity::wt_ate to calculate inverse probability weights, starting from the propensity scores, as in the code below:

propensity_model <- glm(
  net ~ income + health + temperature,
  data = causalworkshop::net_data,
  family = binomial()
)

Repeat the calculation of the IPWs, using the definition of the weight as the inverse probability:

YOUR ANSWER Q4:

#
# calculate inverse probability weights
net_data_wts <- propensity_model |>
  broom::augment(newdata = causalworkshop::net_data, type.predict = "response") |>
  dplyr::mutate(ip_wts = ❓)

and show that your calculated weights are the same as those computed by propensity::wt_ate

# show that ip_wts = wts

Exercise 5: Randomized Controlled Trials

The essence of exchangeability is that the treated and untreated groups are very similar with respect to values of potential confounders. Randomization of treatment makes outcomes independent of treatment, and also makes the treated and untreated groups very similar with respect to values of potential confounders.

Show that this is the case for our mosquito net data by simulating random treatment assignment as follows:

#
# use this data - mosquito net data plus a row id numer
smpl_dat <- causalworkshop::net_data |>
  tibble::rowid_to_column()

use tidysmd::tidy_smd with smpl_dat and group=net to calculate the standardized mean differences (SMDs) for the confounders income, health and temperature.
use dplyr::slice_sample to randomly sample from smpl_dat, with proportion 0.5. Give this sample data a name.
mutate the sample to add a column with smpl = 1.
take the data not in the first sample and form a second sample (start with the original data (smpl_dat) and remove the rows that appear in the sample of step 1. This is why we added a row id. Give this second sample data a name.
mutate the second sample to add a column with smpl = 0.
bind the two samples together by rows (e.g. dplyr::bind_rows).
use tidysmd::tidy_smd with the combined samples from step 6 and group=smpl to calculate the standardized mean differences (SMDs) for the confounders income, health and temperature.

Did randomization make the treatment groups more alike with respect to income, health and temperature?

YOUR ANSWER Q5:

# show the smd table for the original data and for the randomized data
#

Exercise 6: Frisch-Waugh-Lovell Theorem

Here we’ll look at credit and default-risk data. First we’ll load the data:

#
# load data
risk_data = readr::read_csv("data/risk_data.csv", show_col_types = FALSE)

The FWL Theorem states that a multivariate linear regression can be estimated all at once or in three separate steps. For example, you can regress default on the financial variables credit_limit, wage, credit_score1, and credit_score2 as follows:

#
# regress default on financial variables in the dataset
model <- lm(default ~ credit_limit + wage +  credit_score1 + credit_score2, data = risk_data)

model |> broom::tidy(conf.int = TRUE)

# A tibble: 5 × 7
  term             estimate  std.error statistic  p.value     conf.low conf.high
  <chr>               <dbl>      <dbl>     <dbl>    <dbl>        <dbl>     <dbl>
1 (Intercept)    0.404      0.00860        46.9  0             3.87e-1   4.21e-1
2 credit_limit   0.00000306 0.00000154      1.99 4.69e- 2      4.16e-8   6.08e-6
3 wage          -0.0000882  0.00000607    -14.5  8.33e-48     -1.00e-4  -7.63e-5
4 credit_score1 -0.0000417  0.0000183      -2.28 2.28e- 2     -7.77e-5  -5.82e-6
5 credit_score2 -0.000304   0.0000152     -20.1  4.10e-89     -3.34e-4  -2.74e-4

Per FWL you can also break this down into

a de-biasing step, where you regress the treatment (credit_limit) on the financial confounders wage, credit_score1, and credit_score2 , obtaining the residuals
a de-noising step, where you regress the outcome (default) on the financial confounders, obtaining the residuals
an outcome model, where you regress the outcome residuals from step 2 on the treatment residuals of step 1.

Due to confounding, the data looks like this, with default percentage trending down by credit limit.

risk_data |> 
  dplyr::group_by(credit_limit) |> 
  # add columns for number of measurements in the group, and the mean of the group
  dplyr::mutate(size = n(), default = mean(default), ) |> 
  # pull ot the distict values
  dplyr::distinct(default,credit_limit, size) |> 
  ggplot(aes(x = credit_limit, y = default, size = size)) +
  geom_point() +
  labs(title = "Default Rate by Credit Limit", x = "credit_limit", y = "default") +
  theme_minimal()

Step 1:

Create the debiasing model, and
add the residuals to the risk_data, saving the result in risk_data_deb in a column credit_limit_res.
plot the de-biased data
regress the outcome (default) on credit_limit_res

The de-biasing step is crucial for estimating the correct causal effect, while the de-noising step is nice to have, since it reduces the variance of the coefficient estimate and narrows the confidence interval.

Step 2:

create the de-noising model,
add the residuals to the de-biased data (risk_data_deb), saving the result in risk_data_denoise in a column default_res.

Step 3:

regress the default residuals (default_res) on the credit limit residuals (credit_limit_res)

YOUR ANSWER Q6:

# create de-biasing model
debiasing_model <- lm(❓, data = risk_data)

debiasing_model |> broom::tidy(conf.int = TRUE)


# add a column with the residuals of the de-biasing model 
# (add the residuals to the mean credit limit to give a nice interpretation to a zero residual)
risk_data_deb <- risk_data |> 
  dplyr::mutate(credit_limit_res = mean(credit_limit) + ❓ )


risk_data_deb |> 
  # round the residuals prior to grouping
  dplyr::mutate(credit_limit_res = round(credit_limit_res,digits=-2)) |> 
  dplyr::group_by(credit_limit_res) |> 
  # add columns for number of measurements in the group, and the mean of the group
  dplyr::mutate(size = n(), default = mean(default), ) |> 
  # only plot the residual groups with 'large' numbers of cases
  dplyr::filter(size>30) |> 
  # pull ot the distict values
  dplyr::distinct(default,credit_limit_res, size) |> 
  ggplot(aes(x = credit_limit_res, y = default, size = size)) +
  geom_point() +
  labs(title = "Default Rate by Debiased Credit Limit", x = "credit_limit", y = "default")+
  theme_minimal()

# run regression with residuals
lm(default ~ credit_limit_res, data = risk_data_deb) |> 
    broom::tidy(conf.int = TRUE)

In this last regression, the coefficient estimated for credit_limit_res should be the same as the coefficient estimated for credit_limit in the initial regression.

Note the difference in the confidence intervals though.

# create the de-noising model
denoising_model <- 
    lm(default ~ ❓, data = risk_data_deb)


risk_data_denoise <-  risk_data_deb |> 
  # add a column with the residuals of the de-noised model 
  # (add the residuals to the mean default to give a nice interpretation to the zero residual)
  dplyr::mutate(default_res = denoising_model$residuals + mean(default))

Now regress the residuals on each other

#

lm(❓, data = risk_data_denoise) |> 
    broom::tidy(conf.int = TRUE)

How do the coefficients and confidence intervals from the FWL steps above compare to the original multivariate regression?

Exercise 7: Causal Modeling

Questions at the end

In this guided exercise, we’ll attempt to answer a causal question: does quitting smoking make you gain weight? Causal modeling has a special place in the history of smoking research: the studies that demonstrated that smoking causes lung cancer were observational. Thanks to other studies, we also know that, if you’re already a smoker, quitting smoking reduces your risk of lung cancer. However, some have observed that former smokers tend to gain weight. Is this the result of quitting smoking, or does something else explain this effect? In the book Causal Inference: What If by Hernán and Robins, the authors analyze this question using several causal inference techniques.

To answer this question, we’ll use causal inference methods to examine the relationship between quitting smoking and gaining weight. First, we’ll draw our assumptions with a causal diagram (a directed acyclic graph, or DAG), which will guide our model. Then, we’ll use a modeling approach called inverse probability weighting–one of many causal modeling techniques–to estimate the causal effect we’re interested in.

We’ll use data from NHEFS to try to estimate the causal effect of quitting smoking on weight game. NHEFS is a longitudinal, observational study that has many of the variables we’ll need. Take a look at causaldata::nhefs_codebook if you want to know more about the variables in this data set. These data are included in the {causaldata} package. We’ll use the causaldata::nhefs_complete data set, but we’ll remove people who were lost to follow-up.

#
nhefs_complete_uc <- causaldata::nhefs_complete |>
  dplyr::filter(censored == 0)
nhefs_complete_uc

# A tibble: 1,566 × 67
    seqn  qsmk death yrdth modth dadth   sbp   dbp sex     age race  income
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <fct>  <dbl>
 1   233     0     0    NA    NA    NA   175    96 0        42 1         19
 2   235     0     0    NA    NA    NA   123    80 0        36 0         18
 3   244     0     0    NA    NA    NA   115    75 1        56 1         15
 4   245     0     1    85     2    14   148    78 0        68 1         15
 5   252     0     0    NA    NA    NA   118    77 0        40 0         18
 6   257     0     0    NA    NA    NA   141    83 1        43 1         11
 7   262     0     0    NA    NA    NA   132    69 1        56 0         19
 8   266     0     0    NA    NA    NA   100    53 1        29 0         22
 9   419     0     1    84    10    13   163    79 0        51 0         18
10   420     0     1    86    10    17   184   106 0        43 0         16
# ℹ 1,556 more rows
# ℹ 55 more variables: marital <dbl>, school <dbl>, education <fct>, ht <dbl>,
#   wt71 <dbl>, wt82 <dbl>, wt82_71 <dbl>, birthplace <dbl>,
#   smokeintensity <dbl>, smkintensity82_71 <dbl>, smokeyrs <dbl>,
#   asthma <dbl>, bronch <dbl>, tb <dbl>, hf <dbl>, hbp <dbl>,
#   pepticulcer <dbl>, colitis <dbl>, hepatitis <dbl>, chroniccough <dbl>,
#   hayfever <dbl>, diabetes <dbl>, polio <dbl>, tumor <dbl>, …

Let’s look at the distribution of weight gain between the two groups.

#
nhefs_complete_uc |>
  ggplot(aes(wt82_71, fill = factor(qsmk))) + 
  geom_vline(xintercept = 0, color = "grey60", linewidth = 1) +
  geom_density(color = "white", alpha = .75, linewidth = .5) +
  ggokabeito::scale_color_okabe_ito(order = c(1, 5)) + 
  theme_minimal() +
  theme(legend.position = "bottom") + 
  labs(
    x = "change in weight (kg)",
    fill = "quit smoking (1 = yes)"
  )

There’s a difference–former smokers do seemed to have gained a bit more weight–but there’s also a lot of variation. Let’s look at the numeric summaries.

# ~2.5 kg gained for quit vs. not quit
nhefs_complete_uc |>
  dplyr::group_by(qsmk) |>
  dplyr::summarize(
    mean_weight_change = mean(wt82_71), 
    sd = sd(wt82_71),
    .groups = "drop"
  )

# A tibble: 2 × 3
   qsmk mean_weight_change    sd
  <dbl>              <dbl> <dbl>
1     0               1.98  7.45
2     1               4.53  8.75

Here, it looks like those who quit smoking gained, on average, 4.5 kg. But is there something else that could explain these results? There are many factors associated with both quitting smoking and gaining weight; could one of those factors explain away the results we’re seeing here?

To truly answer this question, we need to specify a causal diagram based on domain knowledge. For most circumstances, there is no data-driven approach that consistently identify confounders. Only our causal assumptions can help us identify them. Causal diagrams are a visual expression of those assumptions linked to rigorous mathematics that allow us to understand what we need to account for in our model.

In R, we can visualize and analyze our DAGs with the ggdag package. ggdag uses ggplot2 and ggraph to visualize diagrams and dagitty to analyze them. Let’s set up our assumptions. The dagify() function takes formulas, much like lm() and friends, to express assumptions. We have two basic causal structures: the causes of quitting smoking and the causes of gaining weight. Here, we’re assuming that the set of variables here affect both. Additionally, we’re adding qsmk as a cause of wt82_71, which is our causal question; we also identify these as our outcome and exposure. Finally, we’ll add some labels so the diagram is easier to understand. The result is a dagitty object, and we can transform it to a tidy_dagitty data set with tidy_dagitty().

# set up DAG
smk_wt_dag <- ggdag::dagify(
  # specify causes of quitting smoking and weight gain:
  qsmk ~ sex + race + age + education + 
    smokeintensity + smokeyrs + exercise + active + wt71,
  wt82_71 ~ qsmk + sex + race + age + education + 
    smokeintensity + smokeyrs + exercise + active + wt71,
  # specify causal question:
  exposure = "qsmk", 
  outcome = "wt82_71",
  coords = ggdag::time_ordered_coords(),
  # set up labels:
  # here, I'll use the same variable names as the data set, but I'll label them
  # with clearer names
  labels = c(
    # causal question
    "qsmk" = "quit\nsmoking",
    "wt82_71" = "change in\nweight",
    
    # demographics
    "age" = "age",
    "sex" = "sex",
    "race" = "race",
    "education" = "education",
    
    # health
    "wt71" = "baseline\nweight",
    "active" = "daily\nactivity\nlevel",
    "exercise" = "exercise",
    
    # smoking history
    "smokeintensity" = "smoking\nintensity",
    "smokeyrs" = "yrs of\nsmoking"
  )
) |>
  ggdag::tidy_dagitty()

smk_wt_dag

# A DAG with 11 nodes and 19 edges
#
# Exposure: qsmk
# Outcome: wt82_71
#
# A tibble: 20 × 9
   name               x     y direction to       xend  yend circular label      
   <chr>          <int> <int> <fct>     <chr>   <int> <int> <lgl>    <chr>      
 1 active             1    -4 ->        qsmk        2     0 FALSE    "daily\nac…
 2 active             1    -4 ->        wt82_71     3     0 FALSE    "daily\nac…
 3 age                1    -3 ->        qsmk        2     0 FALSE    "age"      
 4 age                1    -3 ->        wt82_71     3     0 FALSE    "age"      
 5 education          1    -2 ->        qsmk        2     0 FALSE    "education"
 6 education          1    -2 ->        wt82_71     3     0 FALSE    "education"
 7 exercise           1    -1 ->        qsmk        2     0 FALSE    "exercise" 
 8 exercise           1    -1 ->        wt82_71     3     0 FALSE    "exercise" 
 9 qsmk               2     0 ->        wt82_71     3     0 FALSE    "quit\nsmo…
10 race               1     0 ->        qsmk        2     0 FALSE    "race"     
11 race               1     0 ->        wt82_71     3     0 FALSE    "race"     
12 sex                1     1 ->        qsmk        2     0 FALSE    "sex"      
13 sex                1     1 ->        wt82_71     3     0 FALSE    "sex"      
14 smokeintensity     1     2 ->        qsmk        2     0 FALSE    "smoking\n…
15 smokeintensity     1     2 ->        wt82_71     3     0 FALSE    "smoking\n…
16 smokeyrs           1     3 ->        qsmk        2     0 FALSE    "yrs of\ns…
17 smokeyrs           1     3 ->        wt82_71     3     0 FALSE    "yrs of\ns…
18 wt71               1     4 ->        qsmk        2     0 FALSE    "baseline\…
19 wt71               1     4 ->        wt82_71     3     0 FALSE    "baseline\…
20 wt82_71            3     0 <NA>      <NA>       NA    NA FALSE    "change in…

Let’s visualize our assumptions with ggdag().

smk_wt_dag |>
  ggdag::ggdag(text = FALSE, use_labels = "label")

What do we need to control for to estimate an unbiased effect of quitting smoking on weight gain? In many DAGs, there will be many sets of variables–called adjustment sets–that will give us the right effect (assuming our DAG is correct–a big, unverifiable assumption!). ggdag_adjustment_set() can help you visualize them. Here, there’s only one adjustment set: we need to control for everything! While we’re add it, since a {ggdag} plot is just a {ggplot2} plot, let’s clean it up a bit, too.

smk_wt_dag |>
  ggdag::ggdag_adjustment_set(text = FALSE, use_labels = "label") +
  ggdag::theme_dag() +
  ggokabeito::scale_color_okabe_ito(order = c(1, 5)) + 
  ggokabeito::scale_fill_okabe_ito(order = c(1, 5))

Let’s fit a model with these variables. Note that we’ll also fit all continuous variables with squared terms, to allow them a bit of flexibility.

lm(
  wt82_71~ qsmk + sex + 
    race + age + I(age^2) + education + 
    smokeintensity + I(smokeintensity^2) + 
    smokeyrs + I(smokeyrs^2) + exercise + active + 
    wt71 + I(wt71^2), 
  data = nhefs_complete_uc
) |>
  broom::tidy(conf.int = TRUE) |>
  dplyr::filter(term == "qsmk")

# A tibble: 1 × 7
  term  estimate std.error statistic  p.value conf.low conf.high
  <chr>    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 qsmk      3.46     0.438      7.90 5.36e-15     2.60      4.32

When we adjust for the variables in our DAG, we get an estimate of about 3.5 kg–people who quit smoking gained about this amount of weight. However, we are trying to answer a specific causal question: how much weight would a person gain if the quit smoking vs. if the same person did not quit smoking? Let’s use an inverse probability weighting model to try to estimate that effect at the population level (what if everyone quit smoking vs what if no one quit smoking).

For a simple IPW model, we have two modeling steps. First, we fit a propensity score model, which predicts the probability that you received a treatment or exposure (here, that a participant quit smoking). We use this model to calculate inverse probability weights–1 / your probability of treatment. Then, in the second step, we use this weights in the outcome model, which estimates the effect of exposure on the outcome (here, the effect of quitting smoking on gaining weight).

For the propensity score model, we’ll use logistic regression (since quitting smoking is a binary variable). The outcome is quitting smoking, and the variables in the model are all those included in our adjustment set. Then, we’ll use augment() from {broom} (which calls predict() on the inside) to calculate our weights using propensity::wt_ate() and save it back into our data set.

propensity_model <- glm(
  qsmk ~ sex + 
    race + age + I(age^2) + education + 
    smokeintensity + I(smokeintensity^2) + 
    smokeyrs + I(smokeyrs^2) + exercise + active + 
    wt71 + I(wt71^2), 
  family = binomial(), 
  data = nhefs_complete_uc
)

# calculate weights (IPW)
nhefs_complete_uc <- propensity_model |>
  # predict whether quit smoking
  broom::augment(type.predict = "response", data = nhefs_complete_uc) |>
  # calculate inverse probability
  dplyr::mutate(wts = propensity::wt_ate(.fitted, qsmk))

ℹ Treating `.exposure` as binary

# select the columns needed
nhefs_complete_uc |>
  dplyr::select(qsmk, .fitted, wts)

# A tibble: 1,566 × 3
    qsmk .fitted        wts
   <dbl>   <dbl> <psw{ate}>
 1     0  0.0987   1.109530
 2     0  0.140    1.163233
 3     0  0.126    1.143679
 4     0  0.400    1.666994
 5     0  0.294    1.415590
 6     0  0.170    1.204166
 7     0  0.220    1.282511
 8     0  0.345    1.527715
 9     0  0.283    1.395508
10     0  0.265    1.360346
# ℹ 1,556 more rows

Let’s look at the distribution of the weights.

ggplot(nhefs_complete_uc, aes(wts)) +
  geom_histogram(color = "white", fill = "#E69F00", bins = 50) + 
  #  use a log scale for the x axis
  scale_x_log10() + 
  theme_minimal(base_size = 20) + 
  xlab("Weights")

It looks a little skewed, in particular there are some participants with much higher weights. There are a few techniques for dealing with this–trimming weights and stabilizing weights–but we’ll keep it simple for now and just use them as is.

The main goal here is to break the non-causal associations between quitting smoking and gaining weight–the other paths that might distort our results. In other words, if we succeed, there should be no differences in these variables between our two groups, those who quit smoking and those who didn’t. This is where randomized trials shine; you can often assume that there is no baseline differences among potential confounders between your treatment groups (of course, no study is perfect, and there’s a whole set of literature on dealing with this problem in randomized trials).

Standardized mean differences (SMD) are a simple measurement of differences that work across variable types. In general, the closer to 0 we are, the better job we have done eliminating the non-causal relationships we drew in our DAG. Note that low SMDs for everything we adjust for does not mean that there is not something else that might confound our study. Unmeasured confounders or misspecified DAGs can still distort our effects, even if our SMDs look great!

We’ll use the {halfmoon} package to calculate the SMDs, then visualize them.

#
# identify the variables
vars <- c(
  "sex", "race", "age", "education", 
  "smokeintensity", "smokeyrs", 
  "exercise", "active", "wt71"
)

# form data frame with data from smd
plot_df <- halfmoon::tidy_smd(
    nhefs_complete_uc,
    all_of(vars),
    qsmk,
    wts
)

#| plot
ggplot(
    data = plot_df,
    mapping = aes(x = abs(smd), y = variable, group = method, color = method)
) + halfmoon::geom_love()

These look pretty good! Some variables are better than others, but weighting appears to have done a much better job eliminating these differences than an unadjusted analysis.

We can also use halfmoon’s geom_mirror_histogram() to visualize the impact that the weights are having on our population.

nhefs_complete_uc |>
  dplyr::mutate(qsmk = factor(qsmk)) |>
  ggplot(aes(.fitted)) +
  halfmoon::geom_mirror_histogram(
    aes(group = qsmk),
    bins = 50
  ) +
  halfmoon::geom_mirror_histogram(
    aes(fill = qsmk, weight = wts),
    bins = 50,
    alpha = .5
  ) +
  scale_y_continuous(labels = abs) +
  labs(x = "propensity score") + 
  theme_minimal(base_size = 20)

Both groups are being upweighted so that their distributions of propensity scores are much more similar.

We could do more here to analyze our assumptions, but let’s move on to our second step: fitting the outcome model weighted by our inverse probabilities. Some researchers call these Marginal Structural Models, in part because the model is marginal; we only need to include our outcome (wt82_71) and exposure (qsmk). The other variables aren’t in the model; they are accounted for with the IPWs!

#
ipw_model <- lm(
  wt82_71 ~ qsmk, 
  data = nhefs_complete_uc, 
  weights = wts # inverse probability weights
) 

# pull the estimate
ipw_estimate <- ipw_model |>
  broom::tidy(conf.int = TRUE) |>
  dplyr::filter(term == "qsmk")

# show the estimate
ipw_estimate

# A tibble: 1 × 7
  term  estimate std.error statistic  p.value conf.low conf.high
  <chr>    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 qsmk      3.44     0.408      8.43 7.47e-17     2.64      4.24

This estimate is pretty similar to what we saw before, if a little smaller. In fact, for simple causal questions, this is often the case: adjusting for confounders directly in your regression model sometimes estimates the same effect as IPWs and other causal techniques. Causal techniques are special, though, in that the use counterfactual modeling, which allows you to deal with many circumstances, such as when you have selection bias or time-dependendent confounding. They also often have variance properties.

But we have other problem that we need to address. While we’re just using lm() to estimate our IPW model, it doesn’t properly account for the weights. That means our standard error is too small, which will artificially narrow confidence intervals and artificially shrink p-values. There are many ways to address this, including robust estimators. We’ll focus on using the bootstrap via the {rsamples} package in this workshop, but here’s one way to do it with robust standard errors:

# also see robustbase, survey, gee, and others
library(estimatr, quietly=TRUE)
ipw_model_robust <- lm_robust( 
  wt82_71 ~ qsmk, 
  data = nhefs_complete_uc, 
  weights = wts 
) 

# pull the estimate
ipw_estimate_robust <- ipw_model_robust |>
  broom::tidy(conf.int = TRUE) |>
  dplyr::filter(term == "qsmk")

# show the estimate
ipw_estimate_robust

  term estimate std.error statistic      p.value conf.low conf.high   df
1 qsmk 3.440535 0.5264638  6.535179 8.573524e-11 2.407886  4.473185 1564
  outcome
1 wt82_71

Now let’s try the bootstrap. First, we need to wrap our model in a function so we can call it many times on our bootstrapped data. A function like this might be your instinct; however, it’s not quite right.

# fit ipw model for a single bootstrap sample
fit_ipw_not_quite_rightly <- function(split, ...) {
  # get bootstrapped data sample with `rsample::analysis()`
  .df <- rsample::analysis(split)
  
  # fit ipw model
  lm(wt82_71 ~ qsmk, data = .df, weights = wts) |>
    tidy()
}

The problem is that we need to account for the entire modeling process, so we need to include the first step of our analysis – fitting the inverse probability weights.

#
fit_ipw <- function(split, ...) {
  .df <- rsample::analysis(split)
  
  # fit propensity score model
  propensity_model <- glm(
    qsmk ~ sex + 
      race + age + I(age^2) + education + 
      smokeintensity + I(smokeintensity^2) + 
      smokeyrs + I(smokeyrs^2) + exercise + active + 
      wt71 + I(wt71^2), 
    family = binomial(), 
    data = .df
  )
  
  # calculate inverse probability weights
  .df <- propensity_model |>
    broom::augment(type.predict = "response", data = .df) |>
    dplyr::mutate(wts = propensity::wt_ate(
      .fitted,
      qsmk, 
      exposure_type = "binary"
    ))
  
  # fit correctly bootstrapped ipw model
  lm(wt82_71 ~ qsmk, data = .df, weights = wts) |>
    tidy()
}

rsample makes the rest easy for us: bootstraps() resamples our data 1000 times, then we can use purrr::map() to apply our function to each resampled set (splits). rsample’s int_*() functions help us get confidence intervals for our estimate.

#
# fit ipw model to bootstrapped samples | THIS MAY TAKE SOME TIME
ipw_results <- rsample::bootstraps(causaldata::nhefs_complete, 1000, apparent = TRUE) |>
  dplyr::mutate(results = purrr::map(splits, fit_ipw))

# get t-statistic-based CIs
boot_estimate <- rsample::int_t(ipw_results, results) |>
  dplyr::filter(term == "qsmk")

# show estimate
boot_estimate

# A tibble: 1 × 6
  term  .lower .estimate .upper .alpha .method  
  <chr>  <dbl>     <dbl>  <dbl>  <dbl> <chr>    
1 qsmk    2.47      3.47   4.45   0.05 student-t

Let’s compare to our naive weighted model that just used a single estimate from lm()

#
dplyr::bind_rows(
  ipw_estimate |>
    dplyr::select(estimate, conf.low, conf.high) |>
    dplyr::mutate(type = "ols"),
  ipw_estimate_robust |>
    dplyr::select(estimate, conf.low, conf.high) |>
    dplyr::mutate(type = "robust"),
  boot_estimate |>
    dplyr::select(estimate = .estimate, conf.low = .lower, conf.high = .upper) |>
    dplyr::mutate(type = "bootstrap")
) |>
  #  calculate CI width to sort by it
  dplyr::mutate(width = conf.high - conf.low) |>
  dplyr::arrange(width) |>
  #  fix the order of the model types for the plot  
  dplyr::mutate(type = forcats::fct_inorder(type)) |>
  ggplot(aes(x = type, y = estimate, ymin = conf.low, ymax = conf.high)) + 
    geom_pointrange(color = "#0172B1", size = 1, fatten = 3) +
    coord_flip() +
    theme_minimal(base_size = 20) +
    theme(axis.title.y = element_blank())

Our bootstrapped confidence intervals are wider, which is expected; remember that they were artificially narrow in the naive OLS model!

So, we have a final estimate for our causal effect: on average, a person who quits smoking will gain 3.5 kg (95% CI 2.4 kg, 4.4 kg) versus if they had not quit smoking.

Questions:

Please enumerate the steps in the causal analysis workflow
What do you think? Is this estimate reliable? Did we do a good job addressing the assumptions we need to make for a causal effect, particularly that there is no confounding? How might you criticize this model, and what would you do differently?

YOUR ANSWER Q7:

causal workflow steps: _____
your critique of the results of the exercise: ______

Exercise 8: Causal Modeling

An e-commerce company is studying the effect of personalized email campaigns (D) on monthly purchase amount (Y). They have data on 5,000 customers, where 2,000 received personalized emails (D=1) and 3,000 received standard emails (D=0).

Observed data:

Mean monthly purchase in personalized group: E[Y|D=1] = $120
Mean monthly purchase in standard group: E[Y|D=0] = $85
The company suspects that high-value customers were more likely to receive personalized emails

Questions:

Define the following parameters using Rubin’s notation:
- Average Treatment Effect (ATE)
- Average Treatment Effect on the Treated (ATT)
- Average Treatment Effect on the Untreated (ATU)
What is the observed difference in means? What does this estimate capture?
If the true ATE = $25, what is the selection bias? Show your calculation.
Explain why we cannot directly calculate ATT and ATU from the observed data alone.
Under what assumption would the observed difference in means equal the ATE? Is this assumption likely to hold in this marketing context?

YOUR ANSWER Q8:

Define the following parameters using Rubin’s notation:
- Average Treatment Effect (ATE): _____
- Average Treatment Effect on the Treated (ATT): _____
- Average Treatment Effect on the Untreated (ATU): _____
What is the observed difference in means❓ _____
- What does this estimate capture❓ _____
If the true ATE = $25, what is the selection bias❓ _____
- Show your calculation.
Explain why we cannot directly calculate ATT and ATU from the observed data alone.
Under what assumption would the observed difference in means equal the ATE❓ Is this assumption likely to hold in this marketing context❓

Exercise 9: Instrumental Variables

The estimator for instrumental variable $Z$ is conceptually two regressions, one the “reduced form” and the other the “first stage”

$\begin{align*} Y_{i} & =\kappa+\rho Z_{i}+\nu_{i}\;\text{the "reduced form"}\\ D_{i} & =\mu+\pi Z_{i}+\eta_{i}\;\text{the "first stage"} \end{align*}$

where $\nu_1,\eta_i$ are noise terms and $\kappa,\mu$ are intercepts, and our estimate is $\beta^{IV}=\rho/\pi$ .

Show that $\beta^{IV}=\frac{\text{cov}\left(Z_i,Y_i\right)}{\text{cov}\left(Z_i,D_i\right)}$

YOUR ANSWER Q9:

Please show all your work

Exercise 10: Instrumental Variables in Action

An elasticity measures the percentage change in one variable (outcome) in response to a percentage change in another variable (treatment), typically written as: $\eta = \frac{\partial{Y}/Y}{\partial{X}/X}=\frac{d\log(Y)/dY}{d\log(X)/dx}$

Common elasticities include the price elasticity of demand, but in regressing quantity on price it’s very common that price and quantity are simultaneously determined, which leads to biased elasticity estimates (think of the DAG for this problem).

The following code generates synthetic e-commerce data with an elasticity of demand equal -1.5 on total price, which includes list price plus shipping cost.

Here you will use shipping costs (shipping_cost) as the instrument (IV), noting that it has variation (variation in shipping costs across regions and time) and shipping costs don’t directly affect demand (it affects demand only through total price).

generate realistic looking ecommerce data

# Set seed for reproducibility
set.seed(8740)

# Generate synthetic e-commerce data
generate_ecommerce_data <- function(n = 5000) {
  # Create panel structure
  regions <- c("Northeast", "Southeast", "Midwest", "West", "Southwest")
  months <- 1:24
  
  # Expand grid for panel
  data <- tidyr::expand_grid(
    region = regions,
    month = months,
    customer_id = 1:200  # 200 customers per region-month
  ) |>
    dplyr::slice_sample(n = n) |>
    dplyr::mutate(
      # Geographic factors affecting shipping
      distance_to_hub = dplyr::case_when(
        region == "Northeast" ~ 500,
        region == "Southeast" ~ 800,
        region == "Midwest" ~ 300,
        region == "West" ~ 1200,
        region == "Southwest" ~ 900
      ),
      
      # Time-varying factors
      fuel_price = 3 + 0.5 * sin(2 * pi * month / 12) + rnorm(dplyr::n(), 0, 0.2),
      seasonal_demand = 1 + 0.3 * cos(2 * pi * month / 12),
      
      # Unobserved demand shock (creates endogeneity)
      demand_shock = rnorm(dplyr::n(), 0, 0.3),
      
      # Shipping cost (our instrument)
      shipping_cost = 5 + 0.002 * distance_to_hub + 2 * fuel_price + rnorm(dplyr::n(), 0, 1),
      
      # Product price (endogenous - responds to demand)
      base_price = 50,
      price = base_price + 5 * seasonal_demand + 3 * demand_shock + 
        0.5 * shipping_cost + rnorm(dplyr::n(), 0, 2),
      
      # Total price paid by consumer
      total_price = price + shipping_cost,
      
      # Quantity demanded (true elasticity = -1.5)
      log_quantity = 4 - 1.5 * log(total_price) + 0.5 * seasonal_demand + 
        demand_shock + rnorm(dplyr::n(), 0, 0.2),
      quantity = exp(log_quantity),
      
      # Revenue
      revenue = price * quantity,
      
      # Add customer characteristics
      income = exp(rnorm(dplyr::n(), 10.5, 0.3)),
      age = sample(18:65, dplyr::n(), replace = TRUE)
    ) |>
    # Create some missing values to make realistic
    dplyr::mutate(
      price = ifelse(runif(dplyr::n()) < 0.02, NA, price),
      quantity = ifelse(quantity < 0.1, 0.1, quantity)  # No negative quantities
    ) |>
    dplyr::filter(!is.na(price))
  
  return(data)
}

# Generate the dataset
econ_data <- generate_ecommerce_data(n = 5000)

Here are some visualizations of our data. Note that the slope of demand vs total price is +ve instead of -ve.

Plots of demand vs total price and total price vs shipping costs

library(patchwork)
# Visualize price and quantity relationship
p1 <- ggplot(econ_data, aes(x = log(total_price), y = log(quantity))) +
  geom_point(alpha = 0.5, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(title = "Raw Price-Quantity Relationship",
       subtitle = "This relationship is biased due to endogeneity",
       x = "Log(Total Price)", y = "Log(Quantity)") +
  theme_minimal()

# Visualize instrument relevance
p2 <- ggplot(econ_data, aes(x = shipping_cost, y = total_price)) +
  geom_point(alpha = 0.5, color = "darkgreen") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(title = "Instrument Relevance: Shipping Cost vs Total Price",
       subtitle = "First stage relationship",
       x = "Shipping Cost", y = "Total Price") +
  theme_minimal()

p1 + p2

Complete the following steps in the analysis:

Use econ_data to estimate the demand elasticity of total price with a naive model, using log(quantity) ~ log(total_price) and covariates region and month (as factors), income (as log) and age.
Use econ_data to model the “reduced form” stage by regressing log(quantity) with shipping_cost.
Use econ_data to model the “first stage” by regressing log(total_price) on shipping_cost

YOUR ANSWER Q10:

(STEP 1) Use econ_data to estimate the demand elasticity of total price with a naive model, using log(quantity) ~ log(total_price) and covariates region and month (as factors), income (as log) and age.

Use broom::tidy to report the coefficient of log(price) (i.e. the elasticity)

#
# regress log demand on log total price
naive_model <- ❓
  

# pull the coefficient of price as the naive elasticity estimate
naive_results <- ❓

(STEP 2) use econ_data to model the “reduced form” stage by regressing log(quantity) on shipping_cost.

Use broom::tidy to report the coefficient of shipping_cost (i.e. effect of shipping cost on demand)

#
# regress log demand on shipping price
reduced_form_model <- ❓
  
# pull the coefficient of shipping cost 
reduced_form_results <- ❓

(STEP 3) use econ_data to model the “first stage” by regressing log(total_price) on shipping_cost

Use broom::tidy to report the coefficient of shipping_cost (i.e. effect of shipping cost on log(total_price))

#
# regress log(total price)
first_stage_model <- 

# pull the coefficient of shipping cost in the first stage model     
first_stage_results <-

(STEP 4) Estimate the true total_price elasticity

#

# estimate true elasticity
IV_elasticity_est <-

You’re done and ready to submit your work! Save, stage, commit, and push all remaining changes. You can use the commit message “Done with Lab 7!” , and make sure you have committed and pushed all changed files to GitHub (your Git pane in RStudio should be empty) and that all documents are updated in your repo on GitHub.

Submission

I will pull (copy) everyone’s repository submissions at 5:00pm on the Sunday following class, and I will work only with these copies, so anything submitted after 5:00pm will not be graded. (don’t forget to commit and then push your work!)

Grading

Total points available: 30 points.