Causality Part 2

BSMM8740-2-R-2025F [WEEK - 8]

L.L. Odette

Recap of last week

  • Last week we introduced DAGs as a method to represent our causal assumptions and infer how we might estimate the causal effects of interest.
  • We briefly discussed regression adjustment and instrumental variables
  • We worked through a method to estimate causal effects using IPW.

This week

We will look at other methods used to estimate causal effects, along with methods used for special situations, including:

  • regression adjustment
  • doubly robust estimation
  • matching
  • difference in differences
  • fixed effects and methods for panel data
  • g-methods

Causal Effect Estimation

Inverse Probability Weighting

Last week we used IPW to create a pseudopopulation where, for every confounder level, the numbers of treated and untreated were balanced.

IPW requires us to build a model to predict the treatment, depending on the confounders (assuming we have data for all the confounders).

Inverse Probability Weighting

Repeating our our example with our prior process, first we calculate the ipw weights and estimate the ATE using the weights.

ATE estimate by IPW
dat_ <- causalworkshop::net_data |> dplyr::mutate(net = as.numeric(net))

propensity_model <- glm(
  net ~ income + health + temperature
  , data = dat_, family = binomial()
)

net_data_wts <- propensity_model |>
  broom::augment(newdata = dat_, type.predict = "response") |>
  # .fitted is the value predicted by the model
  # for a given observation
  dplyr::mutate(wts = propensity::wt_ate(.fitted, .exposure = net))

net_data_wts |>
  lm(malaria_risk ~ net, data = _, weights = wts) |>
  broom::tidy(conf.int = TRUE) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn() |> 
  gt::as_raw_html()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 42.74 0.4420 96.69 0.000e+00 41.87 43.61
net -12.54 0.6243 -20.10 5.498e-81 -13.77 -11.32

Inverse Probability Weighting

The function 1propensity::wt_ate calculates unstabilized weights by default. In the last lecture we looked at the distribution of weights and decided they were stable enough, because none of the weights were too big or too small.

Let’s explore the question of IPW stabilization a bit more.

Figure 1

Inverse Probability Weighting

Stabilized weights

  • The IP weights (given covariate set \(L\)) are \(W^A=1/\mathbb{E}[D|L]\). Because the denominator can be very close to \(0\) or \(1\) the estimates using these weights can be unstable.

  • Stabilized weights are often used to address this, e.g. the function propensity::wt_ate with stabilize = TRUE multiplies the weights by the mean of the treatment so in this case \(SW^A=\mathbb{E}[D]/\mathbb{E}[D|L]\).

Inverse Probability Weighting

V-Stabilized weights

Baseline covariates (\(V\subset L\)) are also used to stabilize IP weights: \(SW^A(V)=\mathbb{E}[D|V]/\mathbb{E}[D|L]\).

  • Note the the variables \(V\) need to be included in the numerator and the denominator (as part of \(L\))
  • V-stabilization results in IP weights that are more stabilized than the ones without V.

Inverse Probability Weighting

We know that IPW weighting should create a pseudo-population that makes the covariates more balanced by treatment. In the last lecture we checked this using histograms. Here we check this directly via the statistics of the data:

unweighted table composition
gtsummary::tbl_summary(
  net_data_wts 
  , by = net
  , include = c(net,income,health,temperature,malaria_risk)
) |>
  # add an overall column to the table
  gtsummary::add_overall(last = TRUE)
Characteristic 0
N = 1,2981
1
N = 4541
Overall
N = 1,7521
income 880 (751, 1,010) 954 (806, 1,081) 893 (765, 1,031)
health 49 (35, 61) 54 (40, 68) 50 (37, 64)
temperature 23.9 (21.0, 27.1) 23.5 (20.6, 26.9) 23.8 (20.9, 27.1)
malaria_risk 41 (32, 54) 26 (20, 32) 36 (28, 50)
1 Median (Q1, Q3)
weighted table composition
net_svy <- survey::svydesign(
  ids = ~1,
  data = net_data_wts,
  weights = ~wts
)

gtsummary::tbl_svysummary(
  net_svy,
  by = net,
  include = c(net,income,health,temperature,malaria_risk)
) |>
  gtsummary::add_overall(last = TRUE)
Characteristic 0
N = 1,7511
1
N = 1,7601
Overall
N = 3,5111
income 892 (767, 1,025) 892 (755, 1,039) 892 (761, 1,031)
health 50 (36, 63) 49 (37, 64) 50 (37, 64)
temperature 23.9 (20.9, 27.1) 23.8 (20.8, 27.3) 23.8 (20.8, 27.2)
malaria_risk 40 (31, 53) 28 (23, 34) 33 (26, 45)
1 Median (Q1, Q3)

Inverse Probability Weighting

Next we use bootstrapping to generate proper confidence intervals for the effect, first by creating a function to generate (unstabilized) ipw effect estimates for each bootstrap split:

fit_ipw <- function(split, ...) {
  # get bootstrapped data sample with `rsample::analysis()`
  if("rsplit" %in% class(split)){
    .df <- rsample::analysis(split)
  }else if("data.frame" %in% class(split)){
    .df <- split
  }

  # fit propensity score model
  propensity_model <- glm(
    net ~ income + health + temperature,
    data = .df,
    family = binomial()
  )

  # calculate inverse probability weights
  .df <- propensity_model |>
    broom::augment(type.predict = "response", data = .df) |>
    dplyr::mutate(wts = propensity::wt_ate(.fitted, net))

  # fit correctly bootstrapped ipw model
  lm(malaria_risk ~ net, data = .df, weights = wts) |>
    broom::tidy()
}

Inverse Probability Weighting

Next we use our function to bootstrap proper confidence intervals:

# create bootstrap samples
bootstrapped_net_data <- rsample::bootstraps(
  dat_,
  times = 1000,
  # required to calculate CIs later
  apparent = TRUE
)

# create ipw and fit each bootstrap sample
results <- bootstrapped_net_data |>
  dplyr::mutate(
    ipw_fits = purrr::map(splits, fit_ipw))

Regression Adjustment

IPW estimates rely on a model to predict the treatment using covariate/confounder values. We know that we can also predict the effect of treatment by building a model to predict the outcome using a regression model, regressing the effect on the treatment and covariate/confounder values.

regression adjustment
outcome_model <- glm(
  malaria_risk ~ net + income + health + temperature + insecticide_resistance +
    I(health^2) + I(temperature^2) + I(income^2),
  data = dat_
)

outcome_model |> broom::tidy(conf.int = TRUE)
# A tibble: 9 × 7
  term  estimate std.error statistic   p.value conf.low
  <chr>    <dbl>     <dbl>     <dbl>     <dbl>    <dbl>
1 (Int…  1.08e+2   3.55e+0    30.5   1.89e-164  1.01e+2
2 net   -1.24e+1   2.30e-1   -53.8   0         -1.28e+1
3 inco… -1.65e-1   4.58e-3   -36.1   3.48e-213 -1.74e-1
4 heal…  2.01e-1   2.82e-2     7.13  1.44e- 12  1.46e-1
5 temp…  7.69e-1   2.62e-1     2.93  3.42e-  3  2.55e-1
6 inse…  2.19e-1   6.89e-3    31.8   4.97e-175  2.05e-1
7 I(he… -6.60e-4   2.65e-4    -2.49  1.28e-  2 -1.18e-3
8 I(te…  5.01e-3   5.46e-3     0.919 3.58e-  1 -5.68e-3
9 I(in…  5.02e-5   2.50e-6    20.1   1.04e- 80  4.53e-5
# ℹ 1 more variable: conf.high <dbl>

Regression Adjustment

And we can also bootstrap the regression adjustment estimates to get confidence intervals, first by creating the estimation function, then generating bootstrapped estimates, like we did with IPWs:

fit_reg <- function(split, ...) {
  # get bootstrapped data sample with `rsample::analysis()`
  if("rsplit" %in% class(split)){
    .df <- rsample::analysis(split)
  }else if("data.frame" %in% class(split)){
    .df <- split
  }

  # fit outcome model
  glm(malaria_risk ~ net + income + health + temperature + insecticide_resistance +
      I(health^2) + I(temperature^2) + I(income^2), data = .df
    )|>
    broom::tidy()
}

both_results <- results |>
  dplyr::mutate(
    reg_fits = purrr::map(splits, fit_reg))

IPW vs Regression Adjustment

We can compare the results:

both_results_dat <- both_results |>
  dplyr::mutate(
    reg_estimate = purrr::map_dbl(
      reg_fits,
      # pull the `estimate` for net for each fit
      \(.fit) .fit |>
        dplyr::filter(term == "net") |>
        dplyr::pull(estimate)
    )
    , ipw_estimate = purrr::map_dbl(
      ipw_fits,
      # pull the `estimate` for net for each fit
      \(.fit) .fit |>
        dplyr::filter(term == "net") |>
        dplyr::pull(estimate)
    )
  ) 

IPW vs Regression Adjustment

We can plot the results:

Code
dat <- both_results_dat |> 
  dplyr::select(reg_estimate, ipw_estimate) |> 
  tidyr::pivot_longer(cols=everything(), names_to = "method", values_to = "effect estimate")

dat |>
  dplyr::mutate(method=factor(method)) |>
  ggplot( bins = 50, alpha = .5
    , aes(
      `effect estimate`, after_stat(density) , colour = method, fill = method
      )
  ) +
  geom_histogram(alpha = 0.2, position = 'identity') +
  geom_density(alpha = 0.2) +
  theme(legend.text=element_text(size=10), legend.title = element_blank())

IPW vs Regression Adjustment

We can compare the means and the confidence intervals of the regression adjustment and IPW effect estimates:

compare IPW and regression effect estimates
dat |> 
  dplyr::group_by(method) |> 
  dplyr::summarise(
  mean = mean(`effect estimate`)
  , lower_ci = quantile(`effect estimate`, 0.025)
  , upper_ci = quantile(`effect estimate`, 0.975)
) |> gt::gt() |> gtExtras::gt_theme_espn() |> gt::as_raw_html()
method mean lower_ci upper_ci
ipw_estimate -12.53 -13.29 -11.69
reg_estimate -12.37 -12.91 -11.88

Note that each mean ATE estimate is within the CIs of the other ATE mean estimate.

IPW vs Regression Adjustment

There is a package (boot) that performs cross-validation and CI estimation at the same time:

use the boot package for CI estimation of regression adjustment
# bootstrap (1000 times) using the fit_reg function
boot_out_reg <- boot::boot(
  data = causalworkshop::net_data |> dplyr::mutate(net = as.numeric(net))
  , R = 1000
  , sim = "ordinary"
  , statistic =
    (\(x,y){ # x is the data, y is a vector of row numbers for the bootstrap sample
      fit_reg(x[y,]) |>
        dplyr::filter(term == "net") |>
        dplyr::pull(estimate)
    })
)
# calculate the CIs
CIs <- boot_out_reg |>
  boot::boot.ci(L = boot::empinf(boot_out_reg, index=1L, type="jack"))

tibble::tibble(
  CI = c("lower", "upper")
  , normal = CIs$normal[-1]
  , basic = CIs$basic[-(1:3)]
  , percent = CIs$percent[-(1:3)]) |> 
  gt::gt() |> gtExtras::gt_theme_espn() |> gt::as_raw_html()
CI normal basic percent
lower -12.89 -12.92 -12.90
upper -11.83 -11.85 -11.83

Doubly Robust Estimation

Don’t Put All your Eggs in One Basket

  • We’ve learned how to use linear regression and propensity score weighting to estimate \(\left(E[Y|D=1] - E[Y|D=0]\right) | X\). But which one should we use and when?

  • When in doubt, just use both! Doubly Robust Estimation is a way of combining propensity score and linear regression in a way so you don’t have to choose only one of them.

Doubly Robust Estimation

Doubly Robust Estimation

The estimator is as follows.

\[ \hat{ATE} = \frac{1}{N}\sum \bigg( \dfrac{D_i(Y_i - \hat{\mu_1}(X_i))}{\hat{P}(X_i)} + \hat{\mu_1}(X_i) \bigg) - \frac{1}{N}\sum \bigg( \dfrac{(1-D_i)(Y_i - \hat{\mu_0}(X_i))}{1-\hat{P}(X_i)} + \hat{\mu_0}(X_i) \bigg) \]

where

  • \(\hat{P}(x)\) is an estimation of the propensity score (using logistic regression, for example),

  • \(\hat{\mu_1}(x)\) is an estimation of \(E[Y|X, D=1]\) (using linear regression, for example).

The first part of the doubly robust estimator estimates \(E[Y^1]\) and the second part estimates \(E[Y^0]\).

Doubly Robust Estimation

First let’s examine the code. Note we use a recipe with the regression to add the polynomial confounders for each

D <- "net"
Y <- "malaria_risk"
X <- paste0(c('income', 'health', 'temperature'),c(rep('_poly_1',3),rep('_poly_2',3)))

doubly_robust_rec <- causalworkshop::net_data |> 
  dplyr::mutate(net = as.numeric(net)) |> 
  recipes::recipe(malaria_risk ~ net + income + health + temperature) |> 
  # NOTE: these are orthogonal polynomial terms (not used in regression adjustment earlier)
  recipes::step_poly(income, health, temperature) |> 
  recipes::prep() 

doubly_robust_dat <- doubly_robust_rec |> recipes::bake(new_data=NULL)
doubly_robust <- function(df, X, D, Y){
  ps <- # propensity score
    as.formula(paste(D, " ~ ", paste(X, collapse= "+"))) |>
    stats::glm( data = df, family = binomial() ) |>
    broom::augment(type.predict = "response", data = df) |>
    dplyr::pull(.fitted)
  
  lin_frml <- formula(paste(Y, " ~ ", paste(X, collapse= "+")))
  
  idx <- df[,D] |> dplyr::pull(1) == 0
  mu0 <- # mean response D == 0
    lm(lin_frml, data = df[idx,]) |> 
    broom::augment(type.predict = "response", newdata = df[,X]) |>
    dplyr::pull(.fitted)
  
  idx <- df[,D] |> dplyr::pull(1) == 1
  mu1 <- # mean response D == 1
    lm(lin_frml, data = df[idx,]) |>  
    broom::augment(type.predict = "response", newdata = df[,X]) |> 
    dplyr::pull(.fitted)
  
  # convert treatment factor to integer | recast as vectors
  d <- df[,D] |> dplyr::pull(1) |> as.character() |> as.numeric()
  y <- df[,Y] |> dplyr::pull(1)
  
  mean( d*(y - mu1)/ps + mu1 ) -
    mean(( 1-d)*(y - mu0)/(1-ps) + mu0 )
}
doubly_robust_dat |> 
  doubly_robust(X, D, Y)
[1] -12.9

Doubly Robust Estimation

Once again, we can use bootstrap to construct confidence intervals.

use bootstrap fold to estimate using DRE
all_results_dat <- both_results_dat |>
  dplyr::mutate(
    dre_estimate = 
      purrr::map_dbl(
        splits
        , (\(x){
            doubly_robust_rec |> 
              recipes::bake(new_data = rsample::analysis(x)) |> 
              doubly_robust(X, D, Y)
        })
      )
  )

all_results_dat
# Bootstrap sampling with apparent sample 
# A tibble: 1,001 × 7
   splits             id            ipw_fits reg_fits
   <list>             <chr>         <list>   <list>  
 1 <split [1752/658]> Bootstrap0001 <tibble> <tibble>
 2 <split [1752/614]> Bootstrap0002 <tibble> <tibble>
 3 <split [1752/657]> Bootstrap0003 <tibble> <tibble>
 4 <split [1752/630]> Bootstrap0004 <tibble> <tibble>
 5 <split [1752/636]> Bootstrap0005 <tibble> <tibble>
 6 <split [1752/647]> Bootstrap0006 <tibble> <tibble>
 7 <split [1752/666]> Bootstrap0007 <tibble> <tibble>
 8 <split [1752/626]> Bootstrap0008 <tibble> <tibble>
 9 <split [1752/679]> Bootstrap0009 <tibble> <tibble>
10 <split [1752/640]> Bootstrap0010 <tibble> <tibble>
# ℹ 991 more rows
# ℹ 3 more variables: reg_estimate <dbl>,
#   ipw_estimate <dbl>, dre_estimate <dbl>

Doubly Robust Estimation

Looking at the histogram/distributions of all the estimates:

Code
dat <- all_results_dat |> 
  dplyr::select(reg_estimate, ipw_estimate, dre_estimate) |> 
  tidyr::pivot_longer(cols=everything(), names_to = "method", values_to = "effect estimate")

dat |>
  dplyr::mutate(method=factor(method)) |>
  ggplot( bins = 50, alpha = .5
    , aes(
      `effect estimate`, after_stat(density) , colour = method, fill = method
      )
  ) +
  geom_histogram(alpha = 0.2, position = 'identity') +
  geom_density(alpha = 0.2) +
  theme(legend.text=element_text(size=10), legend.title = element_blank())

Doubly Robust Estimation

Confidence intervals, including for the dre estimate are:

compute CIs for IPW, rgression and DRE
dat |> 
  dplyr::group_by(method) |> 
  dplyr::summarise(
  mean = mean(`effect estimate`)
  , lower_ci = quantile(`effect estimate`, 0.025)
  , upper_ci = quantile(`effect estimate`, 0.975)
) |> gt::gt() |> gtExtras::gt_theme_espn() |> gt::as_raw_html()
method mean lower_ci upper_ci
dre_estimate -12.88 -13.56 -12.26
ipw_estimate -12.53 -13.29 -11.69
reg_estimate -12.37 -12.91 -11.88

All methods are consistent.

Doubly Robust Estimation

The doubly robust estimator is called doubly robust because it only requires one of the models, \(\hat{P}(x)\) or \(\hat{\mu}(x)\), to be correctly specified.

Look at the first part that estimates \(E[Y^1]\)

\[ \hat{E}[Y^1] = \frac{1}{N}\sum \bigg( \dfrac{D_i(Y_i - \hat{\mu_1}(X_i))}{\hat{P}(X_i)} + \hat{\mu_1}(X_i) \bigg) \]

Assume that \(\hat{\mu_1}(x)\) is correct. If the propensity score model is wrong, we wouldn’t need to worry. Because if \(\hat{\mu_1}(x)\) is correct, then \(E[D_i(Y_i - \hat{\mu_1}(X_i))]=0\). That is because the multiplication by \(D_i\) selects only the treated and the residual of \(\hat{\mu_1}\) on the treated have, by definition, mean zero.

This causes the whole thing to reduce to \(\hat{\mu_1}(X_i)\), which is correctly estimated \(E[Y^1]\) by assumption. Similar reasoning applies to the estimator of \(E[Y^0]\).

Doubly Robust Estimation

Here we deliberately bias the IPW estimate by replacing the propensity score by a random uniform variable that goes from 0.1 to 0.9 (we don’t want very small weights to blow up the propensity score variance). Since this is random, there is no way it is a good propensity score model, but the doubly robust estimator still manages to produce an estimation that is very close to when the propensity score was estimated with logistic regression.

define a DRE with a bad IPW model
doubly_robust_bad_ipw <- function(df, X, D, Y){
  ps <- runif(dim(df)[1], 0.1, 0.9) # wrong propensity score 
  
  lin_frml <- formula(paste(Y, " ~ ", paste(X, collapse= "+")))
  
  idx <- df[,D] |> dplyr::pull(1) == 0
  mu0 <- # mean response D == 0
    lm(lin_frml, data = df[idx,]) |>
    broom::augment(type.predict = "response", newdata = df[,X]) |>
    dplyr::pull(.fitted)
  
  idx <- df[,D] |> dplyr::pull(1) == 1
  mu1 <- # mean response D == 1
    lm(lin_frml, data = df[idx,]) |>
    broom::augment(type.predict = "response", newdata = df[,X]) |>
    dplyr::pull(.fitted)
  
  # convert treatment factor to integer | recast as vectors
  d <- df[,D] |> dplyr::pull(1) |> as.character() |> as.numeric()
  y <- df[,Y] |> dplyr::pull(1)
  
  mean( d*(y - mu1)/ps + mu1 ) -
    mean(( 1-d)*(y - mu0)/(1-ps) + mu0 )
}

Doubly Robust Estimation

Code
all_results_dat <- all_results_dat |>
  dplyr::mutate(
    dre_bad_ipw_estimate = 
      purrr::map_dbl(
        splits
        , (\(x){
          # rsample::analysis(x) |> 
          #   # dplyr::mutate(net = as.numeric(net)) |> # not needed
          #   recipes::bake(new_data = rsample::analysis(x)) |> 
          doubly_robust_rec |> 
            recipes::bake(new_data = rsample::analysis(x)) |> 
            doubly_robust_bad_ipw(X, D, Y)
        })
      )
  )

dat <- all_results_dat |> 
  dplyr::select(reg_estimate, ipw_estimate, dre_estimate, dre_bad_ipw_estimate) |> 
  tidyr::pivot_longer(cols=everything(), names_to = "method", values_to = "effect estimate")

dat |>
  dplyr::mutate(method=factor(method)) |>
  ggplot( bins = 50, alpha = .5
    , aes(
      `effect estimate`, after_stat(density) , colour = method, fill = method
      )
  ) +
  geom_histogram(alpha = 0.2, position = 'identity') +
  geom_density(alpha = 0.2) +
  theme(legend.text=element_text(size=10), legend.title = element_blank())

Doubly Robust Estimation

summarize the models so far.
dat |> 
  dplyr::group_by(method) |> 
  dplyr::summarise(
  mean = mean(`effect estimate`)
  , lower_ci = quantile(`effect estimate`, 0.025)
  , upper_ci = quantile(`effect estimate`, 0.975)
) |> gt::gt() |> gtExtras::gt_theme_espn() |> gt::as_raw_html()
method mean lower_ci upper_ci
dre_bad_ipw_estimate -12.90 -13.63 -12.15
dre_estimate -12.88 -13.56 -12.26
ipw_estimate -12.53 -13.29 -11.69
reg_estimate -12.37 -12.91 -11.88

Doubly Robust Estimation

Messing up the propensity score yields slightly different ATEs, but not by much.

Now let’s again take a good look at the first part of the estimator, rearranging some terms:

\[ \begin{align*} \hat{E}[Y^{1}] & =\frac{1}{N}\sum\bigg(\dfrac{D_{i}(Y_{i}-\hat{\mu_{1}}(X_{i}))}{\hat{P}(X_{i})}+\hat{\mu_{1}}(X_{i})\bigg)\\ & =\frac{1}{N}\sum\bigg(\dfrac{D_{i}Y_{i}}{\hat{P}(X_{i})}-\dfrac{D_{i}\hat{\mu_{1}}(X_{i})}{\hat{P}(X_{i})}+\hat{\mu_{1}}(X_{i})\bigg)\\ & =\frac{1}{N}\sum\bigg(\dfrac{D_{i}Y_{i}}{\hat{P}(X_{i})}-\bigg(\dfrac{D_{i}}{\hat{P}(X_{i})}-1\bigg)\hat{\mu_{1}}(X_{i})\bigg)\\ & =\frac{1}{N}\sum\bigg(\dfrac{D_{i}Y_{i}}{\hat{P}(X_{i})}-\bigg(\dfrac{D_{i}-\hat{P}(X_{i})}{\hat{P}(X_{i})}\bigg)\hat{\mu_{1}}(X_{i})\bigg) \end{align*} \]

Doubly Robust Estimation

Assume that the propensity score \(\hat{P}(X_i)\) is correctly specified. In this case, \(E[D_i - \hat{P}(X_i)]=0\), which wipes out the part dependent on \(\hat{\mu_1}(X_i)\). This reduces the doubly robust estimator to the propensity score weighting estimator \(\frac{D_iY_i}{\hat{P}(X_i)}\), which is correct by assumption.

So, even if the \(\hat{\mu_1}(X_i)\) is wrong, the estimator will still be correct, provided that the propensity score is correctly specified.

Doubly Robust Estimation

define a DRE with a bad regression model
doubly_robust_bad_reg <- function(df, X, D, Y){
  ps <- # propensity score
    as.formula(paste(D, " ~ ", paste(X, collapse= "+"))) |>
    stats::glm( data = df, family = binomial() ) |>
    broom::augment(type.predict = "response", data = df) |>
    dplyr::pull(.fitted)
  
  mu0 <- rnorm(dim(df)[1], 0, 1) # wrong mean response D == 0
  mu1 <- rnorm(dim(df)[1], 0, 1) # wrong mean response D == 1
  
  # convert treatment factor to integer | recast as vectors
  d <- df[,D] |> dplyr::pull(1) |> as.character() |> as.numeric()
  y <- df[,Y] |> dplyr::pull(1)
  
  mean( d*(y - mu1)/ps + mu1 ) -
    mean(( 1-d)*(y - mu0)/(1-ps) + mu0 )
}

Doubly Robust Estimation

Code
all_results_dat <- all_results_dat |>
  dplyr::mutate(
    dre_bad_reg_estimate = 
      purrr::map_dbl(
        splits
        , (\(x){
          # rsample::analysis(x) |> 
          #   # dplyr::mutate(net = as.numeric(net)) |> # not needed
          #   recipes::bake(new_data = rsample::analysis(x)) |> 
          doubly_robust_rec |> 
            recipes::bake(new_data = rsample::analysis(x)) |> 
            doubly_robust_bad_reg(X, D, Y)
        })
      )
  )

dat <- all_results_dat |> 
  dplyr::select(
    reg_estimate, ipw_estimate, dre_estimate, dre_bad_ipw_estimate, dre_bad_reg_estimate
  ) |> 
  tidyr::pivot_longer(cols=everything(), names_to = "method", values_to = "effect estimate")

dat |>
  dplyr::mutate(method=factor(method)) |>
  ggplot( bins = 50, alpha = .5
    , aes(
      `effect estimate`, after_stat(density) , colour = method, fill = method
      )
  ) +
  geom_histogram(alpha = 0.2, position = 'identity') +
  geom_density(alpha = 0.2) +
  theme(legend.text=element_text(size=10), legend.title = element_blank())

Doubly Robust Estimation

Once again, we can use bootstrap and see that the variance is just slightly higher.

compare the results so far
dat |> 
  dplyr::group_by(method) |> 
  dplyr::summarise(
  mean = mean(`effect estimate`)
  , lower_ci = quantile(`effect estimate`, 0.025)
  , upper_ci = quantile(`effect estimate`, 0.975)
) |> gt::gt() |> gtExtras::gt_theme_espn() |> gt::as_raw_html()
method mean lower_ci upper_ci
dre_bad_ipw_estimate -12.90 -13.63 -12.15
dre_bad_reg_estimate -12.73 -13.49 -11.97
dre_estimate -12.88 -13.56 -12.26
ipw_estimate -12.53 -13.29 -11.69
reg_estimate -12.37 -12.91 -11.88

Doubly Robust Estimation

Once more, messing up the conditional mean model alone yields only slightly different ATE. The magic of doubly robust estimation happens because in causal inference, there are two ways to remove bias from our causal estimates: you either model the treatment mechanism or the outcome mechanism. If either of these models are correct, you are good to go.

One caveat is that, in practice, it’s very hard to model precisely either of those. More often, what ends up happening is that neither the propensity score nor the outcome model are 100% correct. They are both wrong, but in different ways. When this happens, it is not exactly settled [1] [2] [3] if it’s better to use a single model or doubly robust estimation. At least it gives you two possibilities of being correct.

Finite Sample Bias

Finite Sample Bias

We know that not accounting for confounders or blocking open backdoor paths can bias our causal estimates, but it turns out that even after accounting for all confounders, we may still get a biased estimate with finite samples. Many of the properties we tout in statistics rely on large samples—how “large” is defined can be opaque. Let’s look at a quick simulation. Here, we have an exposure/treatment, \(X\), an outcome, \(Y\), and one confounder, \(Z\). We will simulate \(Y\), which is only dependent on \(Z\) (so the true treatment effect is 0), and \(X\), which also depends on \(Z\).

\[ \begin{align*} Z &\sim \mathscr{N}(0,1)\\ X & = \mathrm{ifelse}(0.5+Z>0,1,0)\\ Y & = Z + \mathscr{N}(0,1) \end{align*} \]

Finite Sample Bias

Sampling code

set.seed(928)
n <- 100
finite_sample <- tibble::tibble(
  # z is normally distributed with a mean: 0 and sd: 1
  z = rnorm(n),
  # x is defined from a probit selection model with normally distributed errors
  x = dplyr::case_when(
    0.5 + z + rnorm(n) > 0 ~ 1,
    TRUE ~ 0
  ),
  # y is continuous, dependent only on z with normally distrbuted errors
  y = z + rnorm(n)
)

Finite Sample Bias

If we fit a propensity score model using the one confounder \(Z\) and calculate the weighted estimator, we should get an unbiased result (which in this case would be \(0\)).

fit the propensity score model with finite samples
## fit the propensity score model
finite_sample_wts <- glm(
  x ~ z,
  data = finite_sample,
  family = binomial("probit")
) |>
  broom::augment(newdata = finite_sample, type.predict = "response") |>
  dplyr::mutate(wts = propensity::wt_ate(.fitted, x))

finite_sample_wts |>
  dplyr::summarize(
    effect = sum(y * x * wts) / sum(x * wts) -
      sum(y * (1 - x) * wts) / sum((1 - x) * wts)
  )
# A tibble: 1 × 1
  effect
   <dbl>
1  0.197

Finite Sample Bias

Our effect of is pretty far from 0, although it’s hard to know if this is really bias, or something we are just seeing by chance in this particular simulated sample.

To explore the potential for finite sample bias, we can rerun this simulation many times at different sample sizes:

fit the propensity score model with different number of finite samples
sim <- function(n) {
  ## create a simulated dataset
  finite_sample <- tibble::tibble(
    z = rnorm(n),
    x = dplyr::case_when(
      0.5 + z + rnorm(n) > 0 ~ 1,
      TRUE ~ 0
    ),
    y = z + rnorm(n)
  )
  finite_sample_wts <- glm(
    x ~ z,
    data = finite_sample,
    family = binomial("probit")
  ) |>
    broom::augment(newdata = finite_sample, type.predict = "response") |>
    dplyr::mutate(wts = propensity::wt_ate(.fitted, x))
  bias <- finite_sample_wts |>
    dplyr::summarize(
      effect = sum(y * x * wts) / sum(x * wts) -
        sum(y * (1 - x) * wts) / sum((1 - x) * wts)
    ) |>
    dplyr::pull()
  tibble::tibble(
    n = n,
    bias = bias
  )
}

## Examine 5 different sample sizes, simulate each 1000 times
set.seed(1)
finite_sample_sims <- purrr::map_df(
  rep(
    c(50, 100, 500, 1000, 5000, 10000),
    each = 1000
  ),
  sim
)

bias <- finite_sample_sims |>
  dplyr::group_by(n) |>
  dplyr::summarise(bias = mean(bias))

Finite Sample Bias

Plotting the results:

Finite sample bias present with ATE weights created using correctly specified propensity score model, varying the sample size from n = 50 to n = 10,000

Finite Sample Bias

This is demonstrates finite sample bias. Notice that even when the sample size is quite large (5,000) we still see some bias away from the “true” effect of 0. It isn’t until a sample size larger than 10,000 that we see this bias disappear.

Estimands that utilize weights that are unbounded (i.e. that theoretically can be infinitely large) are more likely to suffer from finite sample bias. The likelihood of falling into finite sample bias depends on:

  1. the estimand you have chosen (i.e. are the weights bounded?)
  2. the distribution of the covariates in the exposed and unexposed groups (i.e. is there good overlap? Potential positivity violations, when there is poor overlap, are the regions where weights can become too large)
  3. the sample size.

Matching

Matching estimator

When some confounder X makes it so that treated and untreated are not initially comparable, we can make them so by matching each treated unit with a similar untreated unit - finding an untreated twin for every treated unit. In making such comparisons, treated and untreated become again comparable.

As an example, let’s suppose we are trying to estimate the effect of a trainee program on earnings. Here is what the trainees looks like:

read trainees data & filter trainees
data_trainees <- readr::read_csv("data/trainees.csv", show_col_types = FALSE)

data_trainees |> 
  dplyr::filter(trainees==1) |> 
  dplyr::slice_head(n=5) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn() |> gt::as_raw_html()
unit trainees age earnings
1 1 28 17700
2 1 34 10200
3 1 29 14400
4 1 25 20800
5 1 29 6100
filter non-trainees from trainees data
data_trainees |> 
  dplyr::filter(trainees==0) |> 
  dplyr::slice_head(n=5) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn() |> gt::as_raw_html()
unit trainees age earnings
20 0 43 20900
21 0 50 31000
22 0 30 21000
23 0 27 9300
24 0 54 41100

Matching estimator

A simple comparison in means suggests that the trainees earn less money than those that didn’t go through the program.

compare non-trainees & trainees mean earnings
data_trainees |> 
  dplyr::group_by(trainees) |> 
  dplyr::summarize(mean_effect = mean(earnings)) |> 
  dplyr::mutate(ATE = mean_effect - dplyr::lag( mean_effect) ) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn() |> gt::as_raw_html()
trainees mean_effect ATE
0 20724 NA
1 16426 -4297

However, if we look at the data, we notice that trainees are much younger than non trainees, which suggests that age could be a confounder.

Matching estimator

We can use matching on age to try to correct that.

We will take unit 1 from the treated and pair it with unit 27, since both are 28 years old. Unit 2 we will pair it with unit 34, unit 3 with unit 37, unit 4 we will pair it with unit 35 and so on. When it comes to unit 5, we need to find someone with age 29 from the non treated, but that is unit 37, which is already paired.

This is not a problem, since we can use the same unit multiple times. If more than 1 unit is a match, we can choose randomly between them.

match on age
unique_on_age <- data_trainees |> 
  dplyr::filter(trainees == 0) |> 
  dplyr::distinct(age, .keep_all = TRUE)

matches <- data_trainees |> 
  dplyr::filter(trainees == 1) |> 
  dplyr::left_join(unique_on_age, by = 'age', suffix = c("_t_1", "_t_0")) |> 
  dplyr::mutate(t1_minus_t0 = earnings_t_1 - earnings_t_0) 

matches |> dplyr::slice_head(n=5) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn() |> gt::as_raw_html()
unit_t_1 trainees_t_1 age earnings_t_1 unit_t_0 trainees_t_0 earnings_t_0 t1_minus_t0
1 1 28 17700 27 0 8800 8900
2 1 34 10200 34 0 24200 -14000
3 1 29 14400 37 0 6200 8200
4 1 25 20800 35 0 23300 -2500
5 1 29 6100 37 0 6200 -100

Matching estimator

If we take the mean of the last column we get the ATE estimate while controlling for age. Notice how the estimate is now positive, compared to the previous one which compared the simple difference in means.

mean differences after matching on age
matches |> dplyr::summarize(ATE = mean(t1_minus_t0)) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn() |> gt::as_raw_html()
ATE
2458

This is a contrived example, just to introduce the matching estimator.

Matching estimator

In practice, we usually have more than one feature and units don’t match perfectly. In this case, we have to define some measurement of proximity to compare how units are close to each other.

One common metric for this is the euclidean norm \(||X_i - X_j||\). This difference, however, is not invariant to the scale of the features.

This means that features like age, that take values on the tens, will be much less important when computing this norm compared to features like income, which take the order of thousands.

For this reason, before applying the norm, we need to scale the features so that they are on roughly the same scale.

Matching estimator

Having defined a distance measure, we can now define the match as the nearest neighbour to the sample we wish to match.

We can write the matching estimator the following way:

\[ \hat{\tau}_{ATE} = \frac{1}{N} \sum^N_{i=1} (2D_i - 1)\big(Y_i - Y_{j}(i)\big) \]

Where \(Y_{j}(i)\) is the sample from the other treatment group which is most similar to \(Y_i\).

We scale by \(2D_i - 1\) to match both ways: treated with controls and controls with the treatment.

Matching estimator

To test this estimator, let’s consider a medical example.

We want to find the effect of a medication on days until recovery.

Unfortunately, this effect is confounded by severity, sex and age. We have reasons to believe that patients with more severe conditions have a higher chance of receiving the medicine.

read medical data and display sample
data_med <- readr::read_csv("data/medicine_impact_recovery.csv", show_col_types = FALSE)

data_med |> 
  dplyr::slice_head(n=5) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn() |> 
  gt::as_raw_html() 
sex age severity medication recovery
0 35.05 0.8877 1 31
1 41.58 0.8998 1 49
1 28.13 0.4863 0 38
1 36.38 0.3231 0 35
0 25.09 0.2090 0 15

Matching estimator

Again looking at the simple difference in means, \(E[Y|D=1]-E[Y|D=0]\), we estimate that the treated take, on average, 16.9 more days to recover than the untreated.

This is probably due to confounding, since we don’t expect the medicine to cause harm to the patient.

contrast recovery by treatment
data_med |> 
  dplyr::group_by(medication) |> 
  dplyr::summarize(mean_effect = mean(recovery)) |> 
  dplyr::mutate(ATE = mean_effect - dplyr::lag( mean_effect) ) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn() |> 
  gt::as_raw_html()
medication mean_effect ATE
0 21.68 NA
1 38.57 16.9

Matching estimator

To correct for confounding, we will control for covariates using matching.

First, we need to remember to scale our covariates, otherwise, features like age will have higher importance than features like severity when we compute the distance between points.

To do this, we can standardise the features.

standardize medical covariates
data_med <- data_med |> recipes::recipe(recovery ~ .) |> 
  recipes::update_role(medication, new_role = 'treatment') |> 
  recipes::step_normalize(recipes::all_predictors()) |> 
  recipes::prep() |> 
  recipes::bake(new_data=NULL)

data_med |>   
  dplyr::slice_head(n=5) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn() |> 
  gt::as_raw_html()
sex age severity medication recovery
-0.997 0.2808 1.4598 1 31
1.003 0.8654 1.5022 1 49
1.003 -0.3387 0.0578 0 38
1.003 0.3995 -0.5126 0 35
-0.997 -0.6105 -0.9111 0 15

Matching estimator

Instead of coding a matching function, we will use the K nearest neighbour algorithm from caret::knnreg.

This algorithm makes predictions by finding the nearest data point in an estimation or training set.

For matching, we will need 2 sets of predicted matches. One variable \(mt_0\), will store the untreated points and will find matches in the untreated when asked to do so.

The other variable \(mt_1\), will store the treated point and will find matches in the treated when asked to do so. After this fitting step, we can use these KNN models to make predictions, which will be our matches.

Matching estimator

matching on medical covariates using knn
treated   <- data_med |> dplyr::filter(medication==1)
untreated <- data_med |> dplyr::filter(medication==0)

mt0 <- # untreated knn model predicting recovery
  caret::knnreg(x = untreated |> dplyr::select(sex,age,severity), y = untreated$recovery, k=1)
mt1 <- # treated knn model predicting recovery
  caret::knnreg(x = treated |> dplyr::select(sex,age,severity), y = treated$recovery, k=1)

predicted <-
  # combine the treated and untreated matches
  c(
    # find matches for the treated looking at the untreated knn model
    treated |>
      tibble::rowid_to_column("ID") |> 
      {\(y)split(y,y$ID)}() |> # hack for native pipe 
      # split(.$ID) |>         # this vesion works with magrittr
      purrr::map(
        (\(x){
          x |> 
            dplyr::mutate(
              match = predict( mt0, x[1,c('sex','age','severity')] )
            )
        })
      )
    # find matches for the untreated looking at the treated knn model
    , untreated |>
        tibble::rowid_to_column("ID") |> 
        {\(y)split(y,y$ID)}() |> 
        # split(.$ID) |> 
        purrr::map(
          (\(x){
            x |> 
              dplyr::mutate(
                match = predict( mt1, x[1,c('sex','age','severity')] )
              )
          })
        )
  ) |>
  # bind the treated and untreated data 
  dplyr::bind_rows()

predicted |>   
  dplyr::slice_head(n=5) |> 
  gt::gt() |> 
  gt::fmt_number(columns = c('sex','age','severity'), decimals = 6) |>
  gtExtras::gt_theme_espn() |> 
  gt::as_raw_html()
ID sex age severity medication recovery match
1 −0.996980 0.280787 1.459800 1 31 39
2 1.002979 0.865375 1.502164 1 49 52
3 −0.996980 1.495134 1.268540 1 38 46
4 1.002979 −0.106534 0.545911 1 34 45
5 −0.996980 0.043034 1.428732 1 30 39

Matching estimator

With the matches, we can now apply the matching estimator formula:

\[ \hat{\tau}_{ATE} = \frac{1}{N} \sum^N_{i=1} (2D_i - 1)\big(Y_i - Y_{j}(i)\big) \]

estimate ATE using knn matching
predicted |> 
  dplyr::summarize(
    "ATE (est)" = mean( (2*medication - 1) * (recovery - match) )
  ) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn() |> 
  gt::as_raw_html()
ATE (est)
-0.9954

Using knn matching, our estimate suggests that the effect of the medicine is not positive anymore.

This means that, controlling for \(X\), the medicine reduces the recovery time by about 1 day, on average.

This is already a huge improvement on top of the biased estimate that predicted a 16.9 increase in recovery time.

Matching bias

It turns out the matching estimator as we’ve designed above is biased.

To see this, let’s consider the ATT estimator, instead of the ATE, since it is simpler to write.

The intuition will apply to the ATE as well.

\[ \hat{\tau}_{ATT} = \frac{1}{n_1}\sum_{i:D_i=1} (Y_i - Y_j(i)) \]

where \(n_1\) is the number of treated individuals and \(Y_j(i)\) is the untreated match of treated unit i.

Matching bias

The bias in the matching estimator arises when the matching is not exact.

To see this, consider the mean outcome for the untreated, given covariates X, \(\mu_0(X_i)=E[Y|X_i, D=0]\).

When we estimate the counterfactual of a treated unit by matching the covariates of the treated unit against the covariates of all untreated units, we may find

\[ \mu_0(X_i) \ne \mu_0(X_j(i)) \]

which leads to a bias in \(\hat{\tau}_{ATT}\).

Matching bias

Bias arises from matching discrepancies. Fortunately, we know how to correct it. Each observation contributes \((\mu_0(X_i) - \mu_0(X_j(i))\) to the bias so all we need to do is subtract this quantity from each matching comparison in our estimator.

To do so, we can replace \(\mu_0(X_j(i))\) with some sort of estimate of this quantity \(\hat{\mu}_0(X_j(i))\), which can be obtained with models like linear regression. This updates the ATET estimator to the following equation:

\[ \hat{\tau}_{ATT} = \frac{1}{N_1}\sum \big((Y_i - Y_{j(i)}) - (\hat{\mu_0}(X_i) - \hat{\mu_0}(X_{j(i)})\big) \]

where \(\hat{\mu_0}(x)\) is some estimative of \(E[Y|X, D=0]\), e.g. a linear regression fitted only on the untreated sample.

Matching bias

bias corrected match
ols0 <- lm(recovery ~ sex + age + severity, data = untreated) 
ols1 <- lm(recovery ~ sex + age + severity, data = treated) 

# find the units that match to the treated
treated_match_index <- # RANN::nn2 does Nearest Neighbour Search
  (RANN::nn2(mt0$learn$X, treated |> dplyr::select(sex,age,severity), k=1))$nn.idx |> 
  as.vector()

# find the units that match to the untreated
untreated_match_index <- # RANN::nn2 does Nearest Neighbour Search
  (RANN::nn2(mt1$learn$X, untreated |> dplyr::select(sex,age,severity), k=1))$nn.idx |> 
  as.vector()

predicted <- 
c(
  purrr::map2(
    .x = 
      treated |> tibble::rowid_to_column("ID") |> {\(y)split(y,y$ID)}() # split(.$ID) 
    , .y = treated_match_index
    , .f = (\(x,y){
          x |> 
            dplyr::mutate(
              match = predict( mt0, x[1,c('sex','age','severity')] )
              , bias_correct =
                  predict( ols0, x[1,c('sex','age','severity')] ) -
                  predict( ols0, untreated[y,c('sex','age','severity')] )
            )
        })
  )
  , purrr::map2(
    .x = 
      untreated |> tibble::rowid_to_column("ID") |> {\(y)split(y,y$ID)}() # split(.$ID) 
    , .y = untreated_match_index
    , .f = (\(x,y){
          x |> 
            dplyr::mutate(
              match = predict( mt1, x[1,c('sex','age','severity')] )
              , bias_correct =
                  predict( ols1, x[1,c('sex','age','severity')] ) -
                  predict( ols1, treated[y,c('sex','age','severity')] )
            )
        })
  )
) |> 
  # bind the treated and untreated data 
  dplyr::bind_rows()

predicted |>   
  dplyr::slice_head(n=5) |> 
  gt::gt() |> 
  gt::fmt_number(columns = c('sex','age','severity'), decimals = 6) |>
  gtExtras::gt_theme_espn()
ID sex age severity medication recovery match bias_correct
1 −0.996980 0.280787 1.459800 1 31 39 4.404
2 1.002979 0.865375 1.502164 1 49 52 12.915
3 −0.996980 1.495134 1.268540 1 38 46 1.871
4 1.002979 −0.106534 0.545911 1 34 45 -0.497
5 −0.996980 0.043034 1.428732 1 30 39 2.610

Matching bias

Doesn’t this defeat the point of matching? If we have to run a linear regression anyway, why don’t we ust use the regression instead of this complicated model.

First of all, this linear regression that we are fitting doesn’t extrapolate on the treatment dimension to get the treatment effect. Instead, its purpose is just to correct bias.

Linear regression here is local, in the sense that it doesn’t try to see how the treated would be if it looked like the untreated. It does none of that extrapolation. This is left to the matching part.

The meat of the estimator is still the matching component. The point is that OLS is secondary to this estimator.

Matching bias

The second point is that matching is a non-parametric estimator. It doesn’t assume linearity or any kind of parametric model.

As such, it is more flexible than linear regression and can work in situations where linear regression will not, namely, those where non linearity is very strong.

Matching bias

With the bias correction formula, I get the following ATE estimation.

predicted |> 
  dplyr::summarize(
    "ATE (est)" = 
      mean( (2*medication - 1) * (recovery - match - bias_correct) )) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn() |> gt::as_raw_html()
ATE (est)
-7.363

Matching CI

Of course, we also need to place a confidence interval around this measurement.

In practice, we can simply use someone else’s code and just import a matching estimator. Here is one from the library Matching (note ATC = ATU).

matching with the Matching package
require(Matching)
#  See https://www.jsekhon.com for additional documentation.
c("ATE", "ATC", "ATT") |> 
  purrr::map(
    (\(x){
      pairmatching <- Matching::Match(
        Y = data_med$recovery
        , Tr = data_med$medication
        , X = data_med |> dplyr::select('sex','age','severity')
        , estimand = x
        , BiasAdjust = TRUE
      )
      tibble::tibble(measure = x, est = pairmatching$est[1,1], se = pairmatching$se)
    })
  ) |> 
  dplyr::bind_rows() |> 
  gt::gt() |> 
  gt::fmt_number(columns = c('est','se'), decimals = 3) |>
  gt::tab_header(
    title = "Treatment Effect Estimates: Matching"
    , subtitle = "using R package Matching"
  ) |> 
  gtExtras::gt_theme_espn() |> gt::as_raw_html()
Treatment Effect Estimates: Matching
using R package Matching
measure est se
ATE −7.708 0.962
ATC −6.664 1.644
ATT −9.680 0.123

Finally, we can say with confidence that our medicine does indeed lower the time someone spends at the hospital. The ATE estimate is just a little bit lower than our algorithm, due to the difference in tie breaking of matches of knn implementation and the Matching R package.

Matching bias again

We saw that matching is biased when the unit and its match are not so similar. But what causes them to be so different?

As it turns out, the answer is quite simple and intuitive. It is easy to find people that match on a few characteristics, like sex. But if we add more characteristics, like age, income, city of birth and so on, it becomes harder and harder to find matches. In more general terms, the more features we have, the higher will be the distance between units and their matches.

Fixed Effects

Fixed Effects

  • One problem we might have is that we can’t really control for variables if we can’t measure them.
  • And there could be variables we need to control for but we can’t measure or don’t have data for! So what can we do?
  • There is a solution, as long as these variables stay constant within some larger category.

Fixed Effects

The Solution

  • If we observe each category (entity/person/firm/country) multiple times, then we can forget about controlling for the related back-door variables we’re interested in, and
  • Just control for entity/person/firm/country identity instead!
  • This will control for EVERYTHING unique to that individual, whether we can measure it or not, as long as they are fixed within the category.

Fixed Effects

Consider data that tracks life expectancy and GDP per capita in many countries over time.

data(gapminder, package = 'gapminder')

gapminder <- gapminder |> dplyr::group_by(country) |> 
  dplyr::mutate(
    lifeExp.r = lifeExp - mean(lifeExp),
    logGDP.r = log(gdpPercap) - mean(log(gdpPercap))
  ) |> 
  dplyr::ungroup()

Then group by country, and estimate the effect of GDP per capita on life expectancy, controlling for country.

Fixed Effects

  • Note that there are LOTS of things that might be back doors between GDP per capita and life expectancy
  • War, disease, political institutions, trade relationships, health of the population, economic institutions…

Fixed Effects

Code
dag <- ggdag::dagify(
  LifeEx~GDPpc+A+B+C+D+E+F+G+H,
  GDPpc~A+B+C+D+E+F+G+H,
  coords=list(
    x=c(LifeEx=4,GDPpc=2,A=1,B=2,C=3,D=4,E=1,F=2,G=3,H=4),
    y=c(LifeEx=2,GDPpc=2,A=3,B=3,C=3,D=3,E=1,F=1,G=1,H=1)
  )
)  |>  ggdag::tidy_dagitty()
ggdag::ggdag_classic(dag,node_size=20) + 
  ggdag::theme_dag_blank()

Fixed Effects

  • There’s no way we can identify the ATE of GDP per capita on life expectancy
  • The list of back doors is very long
  • And likely includes some things we can’t measure!

Fixed Effects

  • HOWEVER! If we think that these things are likely to be constant within country…
  • Then we don’t really have a big long list of back doors, we just have one: “country”

Fixed Effects

  • We can now identify our effect even if some of our back doors include variables that we can’t actually measure
  • When we do this, we’re basically comparing countries to themselves at different time periods!
  • Pretty good way to do an apples-to-apples comparison!

Fixed Effects

Fixed Effects

  • The post-fixed-effects dots are basically a bunch of “Raw Country X” pasted together.
  • Imagine taking “Raw Pakistan” and moving it to the center, then taking “Raw Britain” and moving it to the center, etc.
  • Ignoring the baseline differences between Pakistan, Britain, China, etc., in their GDP per capita and life expectancy, and just looking within each country.
  • We are ignoring all differences between countries (since that way back doors lie!) and looking only at differences within countries.
  • Fixed Effects is sometimes also referred to as the “within” estimator

Fixed Effects

Fixed Effects

  • This does assume, of course, that all those back door variables CAN be described by country
  • In other words, that these back doors operate by things that are fixed within country
  • If something is a back door and changes over time in that country, fixed effects won’t help!

Fixed Effects

  • For example, earlier we mentioned war… that’s not fixed within country! A given country is at war sometimes and not other times.

Fixed Effects

  • Of course, in this case, we could control for War as well and be good!
  • Time-varying variables doesn’t mean that fixed effects doesn’t work, it just means you need to control for time-varying variables too
  • It comes down to thinking carefully about your DAG
  • Fixed effects mainly works as a convenient way of combining together lots of different constant-within-country back doors into something that lets us identify the model even if we can’t measure them all

Fixed Effects Regression

  • We can just calculate fixed effects as we did, i.e. subtract out the group means and analyze (perhaps with regression) what’s left, or
  • We can also include dummy variables for each group/individual, which accomplishes the same thing

\[ Y = \beta_0 + \beta_1Group1 + \beta_2Group2 + ... + \] \[ \beta_NGroupN + \beta_{N+1}X + \varepsilon \] \[ Y = \beta_i + \beta_1X + \varepsilon \]

Fixed Effects Regression

  • Why does that work?
  • We want to control for “group/individual” right? So just put in a control for group/individual.
  • Like all categorical variables as predictors, we leave out a reference group.
  • But here, unlike with, say, a binary predictor, we’re rarely interested in the FE coefficients themselves. Most software works with the mean-subtraction approach (or a variant) and doesn’t even report them!

Fixed Effects Regression: Variation

  • Remember we are isolating within variation
  • If an individual has no within variation, say their treatment never changes, they basically get washed out entirely!
  • A fixed-effects regression wouldn’t represent them. And can’t use FE to study things that are fixed over time
  • And in general if there’s not a lot of within variation, FE is going to be very noisy. Make sure there’s variation to study!

Fixed Effects in Regression: Notes

  • It’s common to cluster standard errors at the level of the fixed effects, since it seems likely that errors would be correlated over time (autocorrelated errors).
  • It’s possible to have more than one set of fixed effects. \(Y = \beta_i + \beta_j + \beta_1X + \varepsilon\)
  • But interpretation gets tricky - think through what variation in \(X\) you have at that point!

Coding up Fixed Effects

  • We can use the fixest package
  • It’s very fast, and can be easily adjusted to do FE with other regression methods like logit, or combined with instrumental variables
  • Clusters are at the first listed fixed effect by default
m1 <- fixest::feols(outcome ~ predictors | FEs, data = data)
msummary(m1)

FE Example: Sentencing

  • What effect do sentencing reforms have on crime?
  • One purpose of punishment for crime is to deter crime
  • If sentences are more clear and less risky, that may reduce a deterrent to crime and so increase crime
  • Marvell & Moody study this using data on reforms in US states from 1969-1989

FE Example: Sentencing

  • In our data we have multiple observations per state
state year assault robbery pop1000 sentreform assaultper1000 robberyper1000
ARIZ 86 14489 5614 3281 1 4.4160 1.7111
WISC 75 2971 3381 4570 0 0.6501 0.7398
SD 74 1008 139 680 0 1.4824 0.2044
MD 84 19364 13097 4349 0 4.4525 3.0115
ME 84 1352 305 1157 1 1.1685 0.2636
KA 81 5301 2611 2390 0 2.2180 1.0925
WASH 76 8327 4317 3691 0 2.2560 1.1696
RI 78 2197 918 957 0 2.2957 0.9592
GA 87 19438 13014 6228 0 3.1211 2.0896
WY 80 1470 208 475 0 3.0947 0.4379

Fixed Effects

  • We can see how robbery rates evolve in each state over time as states implement reform

Fixed Effects

  • You can tell that states are more or less likely to implement reform in a way that’s correlated with the level of robbery they already had
  • So something about the state is driving both the level of robberies and the decision to implement reform
  • Who knows what!
  • Our diagram has reform -> robberies and reform <- state -> robberies, which is something we can address with fixed effects.

Fixed Effects

fixed effects vs regression
sentencing_ols <- lm(robberyper1000 ~ sentreform, data = mmdata)
sentencing_fe <- fixest::feols(robberyper1000 ~ sentreform | state, data = mmdata)
modelsummary::msummary(
  list('OLS' = sentencing_ols, 'FE' = sentencing_fe)
  , stars = TRUE, output = "gt"
  , gof_omit = 'AIC|BIC|F|Lik|Adj|Pseudo'
) |> gt::as_raw_html()
OLS FE
(Intercept) 1.254***
(0.036)
sentreform 0.352*** 0.245**
(0.082) (0.076)
Num.Obs. 1000 1000
R2 0.018 0.919
R2 Within 0.062
RMSE 1.02 0.29
Std.Errors by: state
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Example

  • The coefficients 1.254, 0.352 from OLS reflect the fact that different kinds of states tend to institute reform (between variance)
  • The 0.245 doesn’t! (within variance)
  • Looks like the deterrent effect was real! Although important to consider if there might be time-varying back doors too, we don’t account for those in our analysis
  • What things might change within state over time that would be related to robberies and to sentencing reform?

Event Studies

Event studies

  • Event studies examine how outcomes evolve around the timing of specific events or policy interventions. Event studies can serve as a standalone causal inference method by exploiting sharp temporal variation in treatment timing.
  • If treatment timing is as-good-as-random (conditional on observables), then deviations from pre-treatment trends after the event can be causally attributed to treatment.

Event studies

Regression specification

\[Y_{it} = α_i + λ_t + Σ_{k≠-1} β_k × D_{it}^k + ε_{it}\] Where \(D_{it}^k = 1\) if unit \(i\) is \(k\) periods from treatment at time \(t\). You include \(q\) leads or anticipatory effects and \(m\) lags or post-treatment effects.

This allows you to check both the degree to which the post-treatment treatment effects were dynamic, and whether the two groups were comparable on outcome dynamics pre-treatment.

Event studies

When Event Studies Work as Causal Inference

  • Sharp timing: Clear before/after distinction (regulatory changes, court decisions, natural disasters)
  • Exogenous timing: Treatment timing uncorrelated with potential outcomes
  • Stable trends: Outcomes would follow predictable path absent treatment
  • No anticipation: Agents don’t adjust behavior before official treatment

Event studies

Validity Tests and Diagnostics

  • Pre-trend Analysis: Examine \(β_k\) for \(k < 0\)
  • Should be statistically zero if assumptions hold
  • Gradual buildup suggests anticipation or confounding trends

Placebo Tests:

  • Estimate “fake” treatment at random times in pre-period
  • Apply treatment to similar but untreated units

Robustness Checks:

  • Vary event window length
  • Test sensitivity to functional form assumptions
  • Examine heterogeneity across subgroups

Example

event study example
set.seed(10)
# Create data with 20 groups and 10 time periods
df <- tidyr::crossing(id = 1:20, t = 1:10) |>
 # Add an event in period 6 with a one-period positive effect
 dplyr::mutate(Y = rnorm(dplyr::n()) + 1*(t == 6))

# Use i() in feols to include time dummies,
# specifying that we want to drop t = 5 as the reference
m <- fixest::feols(Y ~ i(t, ref = 5), data = df, cluster = 'id')

# Plot the results, except for the intercept, and add a line joining them 
# and a space and line for the reference group
fixest::coefplot(m, drop = '(Intercept)',
 pt.join = TRUE, ref = c('t:5' = 6), ref.line = TRUE)

Difference-in-Differences

Difference-in-Differences (2x2 DiD)

  • Assume panel data on \(Y_{it}\) for \(t=1,2\) and \(i = 1,...,N\)
  • Treatment timing: Some units (\(D_i=1\)) are treated in period \(2\); every other unit is untreated \((D_i=0)\)
  • Potential outcomes (POs): Observe \(Y_{it}(1) \equiv Y_{it}(0,1)\) for treated units; and \(Y_{it}(0) \equiv Y_{it}(0,0)\) for comparison.

Difference-in-Differences (2x2 DiD)

  • Assume panel data on \(Y_{it}\) for \(t=1,2\) and \(i = 1,...,N\)
  • Treatment timing: Some units (\(D_i=1\)) are treated in period \(2\); every other unit is untreated \((D_i=0)\)
  • Potential outcomes (POs): Observe \(Y_{it}(1) \equiv Y_{it}(0,1)\) for treated units; and \(Y_{it}(0) \equiv Y_{it}(0,0)\) for comparison.

DiD Estimation and Inference

  • The most conceptually simple estimator replaces population means with sample analogs: \[\hat{\tau}_{DiD} = (\bar{Y}_{12} - \bar{Y}_{11}) - (\bar{Y}_{02} - \bar{Y}_{01}) \] where \(\bar{Y}_{dt}\) is sample mean for group \(d\) in period \(t\)
  • This is also know as “4 averages and three subtractions”
  • Causally we evaluate before treatment vs after treatment

DiD Estimation and Inference

  • Conveniently, \(\hat\tau_{DiD}\) is algebraically equal to OLS coefficient \(\hat\beta\) from \[\begin{align*} Y_{it} = \alpha_i + \phi_t + D_{it} \beta + \epsilon_{it}, \end{align*}\] where \(D_{it} = D_i * 1[t=2]\). It’s also equivalent to \(\beta\) from \(\Delta Y_{i} = \alpha + \Delta D_i \beta + u_{it}\).
  • Inference: clustered standard errors are valid as number of clusters grows large.

DiD Estimation and Inference

Aggregating individuals into treated and untreated groups:

\[\begin{align*} Y_{it} = \alpha_i + \beta_1 D_i + \beta_2\text{post}_t + \delta (D_i\times\text{post}_t) + \epsilon_{it} \end{align*}\]

  • \(Y_{i1}(0)=\alpha_i\), \(Y_{i2}(0)=\alpha_i+\beta_2\) and
  • \(Y_{i1}(1)=\alpha_i + \beta_1\), \(Y_{i2}(1)=\alpha_i+\beta_1+\beta_2+\delta\)
  • with \(Y_{i2}(0)-Y_{i1}(0)=\beta_2\) and \(Y_{i2}(1)-Y_{i1}(1)=\beta_2+\delta\)
  • so \((Y_{i2}(1)-Y_{i1}(1)) - (Y_{i2}(0)-Y_{i1}(0))=\delta\equiv\hat{\tau}_{DiD}\)

DiD Causal Inference

Under what conditions does DiD have a causal interpretation?

Our estimator is \[ \hat{\tau}_{DiD}=E[Y_{i2}(1) - Y_{i1}(1)| D_i =1] - E[Y_{i2}(0) - Y_{i1}(0)| D_i =0] \]

If we add zero

\[ \begin{align*} \hat{\tau}_{DiD} & =E[Y_{i2}(1)-Y_{i1}(1)]-E[Y_{i2}(0)-Y_{i1}(0)]\\ & =E[Y_{i2}(1)-Y_{i1}(1)|D_{i}=1]-E[Y_{i2}(0)-Y_{i1}(0)|D_{i}=0]+\\ & E\left[Y_{i2}(1)|D_{i}=0\right]-E\left[Y_{i2}(1)|D_{i}=0\right]\\ & =E[Y_{i2}(1)|D_{i}=1]-E\left[Y_{i2}(1)|D_{i}=0\right]+\\ & E[Y_{i2}(1)-Y_{i1}(1)|D_{i}=0]-E[Y_{i2}(0)-Y_{i1}(0)|D_{i}=0] \end{align*} \]

Causal Inference

So for our DiD estimator to estimate a causal estimand (ATT) we need these assumptions satisfied:

  1. No anticipation: \[E[Y_{i1}(1)|D_{i}=0] = E[Y_{i1}(0)|D_{i}=0]\]

  2. Parallel trends: \[0=E[Y_{i2}(1)-Y_{i1}(1)|D_{i}=0]-E[Y_{i2}(0)-Y_{i1}(0)|D_{i}=0]\]

Causal Inference

  1. No anticipation: no treatment before treatment, so make sure your baseline is untreated (be aware of ”announcement dates” vs ”implementation dates”)

  2. Parallel trends: this needs to be supported by data, e.g. pre-trends or trends in similar groups

Causal Inference

\[ \begin{align*} \hat{\tau}_{DiD} & =\text{ATT}_{\text{post}} \\ & + \text{non PT bias}\\ & + \text{ATT}_{\text{pre}}\,\text{bias} \end{align*} \]

DiD

The 2x2 method is interpreted as before and after, but these periods can be multiple time periods

Simple 2x2 example

Simple 2x2 example

# A tibble: 4 × 3
# Groups:   post [2]
   post treat mean_l_homicide
  <dbl> <dbl>           <dbl>
1     0     0            1.25
2     0     1            1.79
3     1     0            1.20
4     1     1            1.80
[1] 0.06824
OLS estimation, Dep. Var.: l_homicide
Observations: 462
Standard-errors: Clustered (state) 
            Estimate Std. Error t value   Pr(>|t|)    
(Intercept)  1.25185    0.10268 12.1918 3.2179e-15 ***
post::1     -0.05540    0.03302 -1.6775 1.0105e-01    
treat::1     0.53422    0.17048  3.1337 3.1842e-03 ** 
post:treat   0.06824    0.08417  0.8107 4.2221e-01    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.532721   Adj. R2: 0.189901
OLS estimation, Dep. Var.: l_homicide
Observations: 462
Weights: popwt
Standard-errors: Clustered (treat) 
            Estimate Std. Error    t value   Pr(>|t|)
(Intercept)  1.56018  6.970e-16  2.237e+15 2.8452e-16
post::1     -0.09338  5.570e-16 -1.676e+14 3.7992e-15
treat::1     0.35026  7.090e-16  4.939e+14 1.2889e-15
post:treat   0.04565  1.095e-15  4.170e+13 1.5267e-14
               
(Intercept) ***
post::1     ***
treat::1    ***
post:treat  ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 962.9   Adj. R2: 0.134205

Longitudinal Data

  • Two types of longitudinal data:

    • Panel Data: same units tracked over time (e.g., National Longitudinal Survey of Youth 1997)
  • Repeated Cross-Sections: different units sampled at each time (e.g., Census, Current Population Survey)

    • Violations of parallel trends can arise differently across data types.

Longitudinal Data

We have a balanced panel when all units are observed in every period.

  • Imbalanced panel is when units missing in some periods.
    • Anthony is in periods 1-3,
    • Bob is in periods 1-3,
    • Inez is in periods 1 and 3 only,
    • Dignan is in periods 1 and 2 only.
  • Missingness alone does not violate parallel trends, though it does change the parameter.

Longitudinal Data

In the potential outcomes framework, a treatment effect is defined at the individual level, \(\delta_{it}\).

  • So if you are missing a person, i, in a period, t, then it does not contribute.

  • The more heterogeneity in the treatment effects, the more the broken panel will shift away from what you think you’re after.

Repeated cross-sections

  • One of the risks of a repeated cross-section is that the composition of the sample may have changed between the pre and post period in ways that are correlated with treatment.
  • Hong (2013) uses repeated cross-sectional data from the Consumer Expenditure Survey (CEX) containing music expenditure and internet use for a random sample of households.
  • Study exploits the emergence of Napster (first file sharing software widely used by Internet users) in June 1999 as a natural experiment.
  • Study compares internet users and internet non-users before and after emergence of Napster.

Repeated Cross Section Risks

  • Repeated cross sections have their own challenges that panels don’t in that the group could be shifting compositionally.
  • Detect using a balance table with covariates highly predictive of the missing E[Y0|D= 1] for this exercise
    • Percent of cat owners is probably irrelevant to trends in potential outcomes
    • But age and income is probably relevant for spending habits
    • We’ll discuss covariates more later, but for now just consider what characteristics are relevant to your outcome
  • Documenting covariates that cannot be affected by the treatment like this table is a way to check for compositional changes in the sample.

Repeated Cross Section Risks

Changes Between Internet and Non-Internet Users Over Time
Characteristic 1997 1998 1999
User Non-user User Non-user User Non-user
Demographics
Age 40.2 49.0 42.3 49.0 44.1 49.4
Income $52,887 $30,459 $51,995 $26,189 $49,970 $26,649
High school graduate 0.18 0.31 0.17 0.32 0.21 0.32
Some college 0.37 0.28 0.35 0.27 0.34 0.27
College grad 0.43 0.21 0.45 0.21 0.42 0.20
Manager 0.16 0.08 0.16 0.08 0.14 0.08

2X2 DiD & covariates

  • 2x2 relies on parallel trends (PT) and no anticipation (NA) to be causal
  • What if PT is violated and the violation is due to covariates:
    • imbalance between treated an untreated where covariates are related to outcome dynamics
    • heterogeneous treatment effects related to covariates

2X2 DiD & covariates

As before, for our DiD estimator to be a causal estimate (ATT) we need (note: conditional on covariates X)

  1. No anticipation: \(E[Y_{1}(1)|X,D=1] = E[Y_{1}(0)|X,D=0]\)

  2. Parallel trends: \(0=E[Y_{2}(0)-Y_{1}(0)|X,D=1]-E[Y_{2}(0)-Y_{1}(0)|X,D=0]\)

2X2 DiD & covariates (reg)

Under our assumptions, we can model the outcome evolution and estimate the counterfactual untreated outcome in period 2 for the treated group using regression, and our estimator

\[\hat{\tau}_{DiD} = (\bar{Y}_{12} - \bar{Y}_{11}) - (\bar{Y}_{02} - \bar{Y}_{01}) \]

becomes

\[\hat{\tau}_{DiD} = (\bar{Y}_{12} - \bar{Y}_{11}) - (\bar{\hat{\mu}}(X)_{02} - \bar{\hat{\mu}}(X)_{01}) \]

2X2 DiD & covariates (reg)

Using regression the \(\hat{\tau}_{DiD}\) estimate uses

\[ \bar{Y}_{d,t}=\sum_{i\vert D_i=d,T_i=t}Y_{it}/n_{d,t} \]

and \(\hat{\mu}(x)_{dt}\) estimates the true, unknown

\[ m_{d,t}\equiv\mathbb{E}\left[Y_y\vert D=d,X=x\right] \]

2X2 DiD & covariates (reg)

Why not just use fixed effects regression?:

\[ Y_{it}=\beta_1+\beta_2T_i+\beta_3D_i+\tau(T_i\times D_i)+\theta'X_i+\epsilon_{it} \]

  • assumes homogeneous treatment effects
  • trends will not be \(X\)-specific

2X2 DiD & covariates (IPW)

Consider the following expression, with propensity \(\pi=\mathbb{P}(D=1\vert X)\) and \(\rho(D)=\frac{D-\pi}{\pi\left(1-\pi\right)}\):

\[ \begin{align*} E[\rho(D)\left(Y_{t=2}-Y_{t=1}\right)|X] & =\\ E[\rho(D)\left(Y_{t=2}-Y_{t=1}\right)|X,D=1]\pi+E[\rho(D)\left(Y_{t=2}-Y_{t=1}\right)|X,D=0]\left(1-\pi\right) & =\\ E[\frac{1}{\pi}\left(Y_{t=2}-Y_{t=1}\right)|X,D=1]\pi+E[\frac{-1}{\left(1-\pi\right)}\left(Y_{t=2}-Y_{t=1}\right)|X,D=0]\left(1-\pi\right) & =\\ E[Y_{t=2}-Y_{t=1}|X,D=1]-E[Y_{t=2}-Y_{t=1}|X,D=0] & =\\ E[Y_{t=2}^{1}-Y_{t=1}^{0}|X,D=1] \end{align*} \]

2X2 DiD & covariates (IPW)

so \(E[\rho(D)\left(Y_{t=2}-Y_{t=1}\right)|X]=E[Y_{t=2}^{1}-Y_{t=1}^{0}|X,D=1]\) and then averaging over the covariates:

\[ \begin{align*} E[Y_{t=2}^{1}-Y_{t=1}^{0}|D=1] & =\int E[Y_{t=2}^{1}-Y_{t=1}^{0}|X,D=1]d\mathbb{P}(x\vert D=1)\\ & =\int E[\rho(D)\left(Y_{t=2}-Y_{t=1}\right)|X]d\mathbb{P}(x\vert D=1)\\ & =E[\rho(D)\left(Y_{t=2}-Y_{t=1}\right)\frac{\mathbb{P}(D=1\vert X)}{\mathbb{P}(D=1)}]\\ & =E[\frac{Y_{t=2}-Y_{t=1}}{\mathbb{P}(D=1)}\frac{D-\mathbb{P}(D=1\vert X)}{1-\mathbb{P}(D=1\vert X)}] \end{align*} \]

2X2 DiD & covariates (DR)

We can also use a doubly robust estimator:

\[ \mathbb{E}\left[\left(\frac{D}{\mathbb{E}[D]}-\frac{\frac{\pi(1-D)}{1-\pi}}{\mathbb{E\left[\frac{\pi(1-D)}{1-\pi}\right]}}\right)(\Delta Y-\hat{\mu}_{0\Delta}(X))\right] \]

for …

2X2 DiD & Fixed Effects

The regression estimator for the 2x2 DiD (2 groups, 2 time periods) is

\[ Y_{it} = \alpha + \beta \cdot \text{Post}_t + \gamma \cdot \text{Treat}_i + \delta \cdot (\text{Post}_t \times \text{Treat}i) + \epsilon{it} \]

where \(\delta\) is the DiD estimator: it captures the treatment effect. This model is identical to the DiD formula:

\[ \delta = (Y_{treat, post} - Y_{treat, pre}) - (Y_{control, post} - Y_{control, pre}) \]

2X2 DiD & Fixed Effects

The fixed effect estimator is used when you have more than 2 time periods and/or many units, typically in panel data.

\[ Y_{it} = \alpha_i + \lambda_t + \delta \cdot D_{it} + \epsilon_{it} \]

It generalizes the 2x2 model by removing group and time differences via demeaning.

no title

2X2 DiD & Fixed Effects

2×2 Regression DiD Fixed-Effects DiD
Groups 2 (treated/control) Many units/groups
Controls Only group/time/treatment Time and unit fixed effects
Estimator Coefficient on interaction term Coefficient on treatment dummy in FE model
Parallel trends Applies across 2 groups Applies across time for all units
Flexibility Limited Handles richer settings (e.g. unbalanced panels, staggered timing)

Event Studies & Fixed Effects

It is very common to evaluate the common trend assumption using event studies via a dynamic (TWFE) regression specification including indicators for time relative to treatment. This permits “visual inference” to determine whether the treated group’s trends appear to have deviated from the comparison group’s trends right around the time of treatment.

Event Studies & Fixed Effects

The dynamic (TWFE) regression specification is, for period \(t\in{-T,\ldots,T}\) and treatment at T=0:

\[ Y_{it} = \alpha_i + \lambda_t + \sum_{r\ne 1}\delta_r \cdot D_{it}\cdot 1_{t=r+1} + \epsilon_{it} \]

So

\[ \delta_r=\left(\mathbb{E}[Y_{i,r+1}\vert D=1]-\mathbb{E}[Y_{i,r+1}\vert D=0]\right)-\left(\mathbb{E}[Y_{i,0}\vert D=1]-\mathbb{E}[Y_{i,0}\vert D=0]\right) \]

Event Studies & Fixed Effects

In the case where there are no effects and e.g. \(Y^1_{it}=Y^0_{it}=\gamma\cdot t + \epsilon_{it}\) and \(\mathbb{E}[\epsilon_{it}]=0\)

\[ \delta_r=\gamma\cdot(r+1) \]

and the event study looks like:

Event plot
#Set simulation params ----
N <- 100; firstT <- -15; lastT <- 10; slope <- 0.5; seed <- 123

#Generate data ----
set.seed(seed)

# df <- purrr::cross_df(.l = list(t = seq(from = firstT, to = lastT),i = seq(from = 1, to = N)))
df <- tidyr::expand_grid(t = seq(from = firstT, to = lastT),i = seq(from = 1, to = N))

start_dates <- data.frame(i = seq(from =1 , to = N),g = c(rep(1,N/2), rep(Inf, N/2)))

df <- dplyr::left_join(df, start_dates, by = "i"); df$d <- df$g <= df$t

df$y <- slope * df$t * (df$g == 1) + rnorm(NROW(df), mean =0)

# Run TWFE event-study ----
twfe <- fixest::feols(y ~ i(t, g==1, ref=0) | i + t, data = df, cluster = "i") 

eventplotdf <- as.data.frame(twfe$coeftable[1:(lastT-firstT),])
eventplotdf$relativeTime <- c(seq(firstT,-1),seq(1,lastT))-1
eventplotdf$ub <- eventplotdf$Estimate + 1.96 * eventplotdf$`Std. Error`
eventplotdf$lb <- eventplotdf$Estimate - 1.96 * eventplotdf$`Std. Error`

ggplot(eventplotdf,
       aes(x = relativeTime,
           y = Estimate,
           ymax = ub,
           ymin = lb)) +
  geom_point() +
  geom_point(x=-1,y=0) +
  geom_errorbar(width = 0.1) +
  xlab("Relative Time") +
  ggtitle("TWFE Event-study Coefficients") +
  theme(plot.title= element_text(hjust=0.5)) +
  theme_minimal()

Event Studies & Fixed Effects

In the case where there are no effects and e.g. \(Y^1_{it}=Y^0_{it}=\gamma\cdot t + \epsilon_{it}\) and \(\mathbb{E}[\epsilon_{it}]=0\)

\[ \delta_r=\gamma\cdot(r+1) \]

and the event study looks like:

Event plot
#Set simulation params ----
N <- 100; firstT <- -15; lastT <- 10; slope <- 0.5; seed <- 123

#Generate data ----
set.seed(seed)

# df <- purrr::cross_df(.l = list(t = seq(from = firstT, to = lastT),i = seq(from = 1, to = N)))
df <- tidyr::expand_grid(t = seq(from = firstT, to = lastT),i = seq(from = 1, to = N))

start_dates <- data.frame(i = seq(from =1 , to = N),g = c(rep(1,N/2), rep(Inf, N/2)))

df <- dplyr::left_join(df, start_dates, by = "i"); df$d <- df$g <= df$t

df$y <- slope * df$t * (df$g == 1) + rnorm(NROW(df), mean =0)

# Run TWFE event-study ----
twfe <- fixest::feols(y ~ i(t, g==1, ref=0) | i + t, data = df, cluster = "i") 

eventplotdf <- as.data.frame(twfe$coeftable[1:(lastT-firstT),])
eventplotdf$relativeTime <- c(seq(firstT,-1),seq(1,lastT))-1
eventplotdf$ub <- eventplotdf$Estimate + 1.96 * eventplotdf$`Std. Error`
eventplotdf$lb <- eventplotdf$Estimate - 1.96 * eventplotdf$`Std. Error`

ggplot(eventplotdf,
       aes(x = relativeTime,
           y = Estimate,
           ymax = ub,
           ymin = lb)) +
  geom_point() +
  geom_point(x=-1,y=0) +
  geom_errorbar(width = 0.1) +
  xlab("Relative Time") +
  ggtitle("TWFE Event-study Coefficients") +
  theme(plot.title= element_text(hjust=0.5)) +
  theme_minimal() + theme_set(theme_bw(base_size = 18))

Heterogeneity & Differential Timing

Consider when we have multiple groups of units, some treated, some perhaps not treated, where we could have heterogeneous effects (i.e. different causal effects by group or time) and/or differential timing (treatment starts at different times for different groups.

In this case, the TWFE regression is known to not estimate the causal effect.

Heterogeneity & Differential Timing

Along with continuous outcomes \(y_{igt}\), our data consists of (discrete) units \(i\), (discrete) times \(t\), and (discrete) groups, indexed by time first treated \(g\). Under parallel trends, outcomes follow:

\[ \begin{align} y_{igt} = \mu_g + \eta_t + \tau_{gt} D_{gt} + \varepsilon_{igt} \end{align} \]

Where units are indexed by \(i\) and:

  • \(\mu_g\) = group fixed effect: effects constant over all time, by group
  • \(\eta_t\) = time fixed effect: effects constant over all groups, by time
  • \(D_{gt}\equiv \textbf{1}_{g\le t}\): treatment indicator (note - once treated, always treated)
  • \(\tau_{gt}\) = group-time specific treatment effect: i.e. \(\mathbb{E}[Y^1_{gt}-Y^0_{gt}]\) in potential outcomes notation
  • \(\varepsilon_{igt}\) = idiosyncratic error with \(E[\varepsilon_{igt}] = 0\)

The simplest case (again):

\(2\times 2\) DiD: two groups, two periods (\(t\in\{1,2\}\)); one group (A, aka control ; \(g=\infty\)) never treated, the other group (B, aka treatment ; \(g=2\)) treated in the second period. In the data here we have one unit per group, but in practice there will be \(N_{gt}\) units in a group-time.

Under the assumptions:

  1. parallel trends: that the untreated potential outcomes for both groups have parallel trends.
  2. no anticipation: treatment effects are zero in all periods before treatment actually begins.

then \(\tau_{gt}\) represents a causal treatment effect (the estimand).

We have seen this case already.

The simplest case (again):

Finally, we can use the Frisch-Waugh-Lovell (FWL) Theorem to estimate the TWFE model.

Per the FWL Theorem, the TWFE model is equivalent to the residualized regression: \(y_{igt} - \hat{\mu}_g - \hat{\eta}_t = \tau \tilde{D}_{gt} + \tilde{\varepsilon}_{gt}\)

  1. starting with \(y_{igt} = \mu_g + \eta_t + \tau_{gt}D_{gt} + \varepsilon_{igt}\) regress the output \(y_{igt}\) on fixed effects (say \([\mu_g, \eta_t]\))
    • the fitted values are \(\hat{y}_{igt} = \hat{\mu}_g + \hat{\eta}_t\) and so the residuals are \(\tilde{y}_{igt} = y_{igt} - \hat{\mu}_g -\hat{\eta}_t\)
  2. regress the treatment indicator \(D_{gt}\) on fixed effects (say \([\alpha_g, \delta_t]\))
    • the fitted values are \(\hat{D}_{gt}\) and so the residuals are \(\tilde{D}_{gt} = D_{gt} -\hat{D}_{gt}\)
  3. regress the output residuals on the treatment residuals \(\tilde{y}_{igt} = \tau\tilde{D}_{gt} + \tilde{\varepsilon}_{gt}\)

The simplest case (FWL):

And we see that the calculation per FWL is the same as computed by the other two estimators. Note that the FWL estimate can be calculated directly using the residuals

\[ \beta_{FWL}=\frac{\text{cov}(\tilde{D}_{gt},\tilde{y}_{igt})}{\text{var}(\tilde{D}_{gt})} \]

This is equal to the alternative formulation using the mean of the outcomes and the treatment indicator.

\[ \beta_{FWL}=\frac{\text{cov}(\tilde{D}_{gt},\hat{y}_{igt})}{\text{cov}(\tilde{D}_{gt},D_{gt})} \]

In general (FWL):

When we regress outcomes on fixed effects using all observations: \[ Y_{igt} = \hat{\mu}_g + \hat{\eta}_t + \epsilon_{igt} \]

We estimate the fixed effects, which encapsulate the parallel trends assumption. The outcome residuals are: \[ \tilde{Y}_{igt} = Y_{igt} - \hat{\mu}_g - \hat{\eta}_t \]

The estimated fixed effects \(\hat{\mu}_g + \hat{\eta}_t\) represent the counterfactual outcome that treated units would have had without treatment (their \(Y_{igt}^0\)).

Goodman-Bacon decomposition

The introduction of differential treatment timing makes it hard to draw a bright line between pre and post treatment periods.

To show this we use the following data:

new data - 3 groups x 10 periods
dat4_rev = data.frame(
  group_id = rep(1:3, times = 10),
  period_id = rep(1:10, each = 3)
) |>
  within({
    treated = (group_id == 2 & period_id >= 5) | (group_id == 3 & period_id >= 8)
    btrue = ifelse(treated & group_id == 3, 4, ifelse(treated & group_id == 2, 2, 0))
    outcome = group_id + 1 * period_id + btrue * treated
  }) |> 
    dplyr::mutate(
    treated = as.numeric(treated)
    , treatment_effect = 
      dplyr::case_when(
        group_id == 2 & period_id >= 5 ~ 2
        , group_id == 3 & period_id >= 8 ~ 4
        , TRUE ~ 0
      )
  )
dat4_rev |> gt::gt()
group_id period_id outcome btrue treated treatment_effect
1 1 2 0 0 0
2 1 3 0 0 0
3 1 4 0 0 0
1 2 3 0 0 0
2 2 4 0 0 0
3 2 5 0 0 0
1 3 4 0 0 0
2 3 5 0 0 0
3 3 6 0 0 0
1 4 5 0 0 0
2 4 6 0 0 0
3 4 7 0 0 0
1 5 6 0 0 0
2 5 9 2 1 2
3 5 8 0 0 0
1 6 7 0 0 0
2 6 10 2 1 2
3 6 9 0 0 0
1 7 8 0 0 0
2 7 11 2 1 2
3 7 10 0 0 0
1 8 9 0 0 0
2 8 12 2 1 2
3 8 15 4 1 4
1 9 10 0 0 0
2 9 13 2 1 2
3 9 16 4 1 4
1 10 11 0 0 0
2 10 14 2 1 2
3 10 17 4 1 4

Goodman-Bacon decomposition

new data plotted
require(ggplot2, quietly = TRUE)
#theme_set(theme_bw(base_size = 22))
ggplot(dat4_rev, aes(x = period_id, y = outcome, col = factor(group_id))) +
  geom_point() + geom_line() +
  geom_vline(xintercept = c(4.5, 7.5), lty = 2) +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  labs(x = "Time period", y = "Outcome variable", col = "ID") +
  theme_light(base_size = 18) 

Goodman-Bacon decomposition

Here we see that our simulation includes two distinct treatment periods. The first treatment occurs at period 5, where group_id=2’s trendline jumps by 2 units. The second treatment occurs in period 8, where group_id=3’s trendline jumps by 4 units. In contrast, id=1 remains untreated for the duration of the experiment.

It’s not immediately clear how to calculate the ATT. For example, how should the late treated unit (group_id=3) regard the early treated unit (group_id=2)? Can the latter be used as control group for the former? After all, they didn’t receive treatment at the same time… but, on the other hand, group_id=2’s path was already altered by the initial treatment wave.

Goodman-Bacon decomposition

Start by estimating a simple TWFE model (three ways):

TWFE estimated with new data
# (1) using fixest::feols directly
traditional_twfe4_rev <- fixest::feols(outcome ~ treated | group_id + period_id, data = dat4_rev)

# (2) using straight FWL
treatment_on_fe4_rev <- fixest::feols(treated ~ 1 | group_id + period_id, data = dat4_rev)
dat4_rev$treated_resid <- resid(treatment_on_fe4_rev)

outcome_on_fe4_rev <- fixest::feols(outcome ~ 1 | group_id + period_id, data = dat4_rev)
dat4_rev$outcome_resid <- resid(outcome_on_fe4_rev)

# (3) calculate ATT directly
treated <- dat4_rev |>
  dplyr::filter(treated == 1)
true_ATT <- treated |>
  dplyr::pull(treatment_effect) |> mean()

tibble::tibble(
  "True ATT" = true_ATT
  , "Traditional TWFE Estimate" = coef(traditional_twfe4_rev)[1]
  , "FWL Estimated" = lm(outcome_resid ~ treated_resid, data = dat4_rev) |> 
  broom::tidy() |> dplyr::filter(term == 'treated_resid') |> dplyr::pull(estimate)
) |> 
  tidyr::pivot_longer(everything()) |> gt::gt('name')
value
True ATT 2.667
Traditional TWFE Estimate 2.909
FWL Estimated 2.909

Goodman-Bacon decomposition

So all our methods agree to 6 digits, and they are all wrong. What is going on?

The short answer is that each estimate comprises a weighted average of four distinct 2x2 groups (or comparisons):

  1. treated vs untreated (i) early treated (Te) vs untreated (U) (ii) late treated (Tl) vs untreated (U)

  2. differentially treated (i) early treated (Te) vs late control (Cl) (ii) late treated (Tl) vs early control (Ce)

We can visualize these four comparison sets as follows:

Goodman-Bacon decomposition

Goodman-Bacon decomposition plot
rbind(
  dat4_rev |> subset(group_id %in% c(1,2)) |> 
    transform(role = ifelse(group_id==2, "Treatment", "Control"), comp = "1.1. Early vs Untreated"),
  dat4_rev |> subset(group_id %in% c(1,3)) |> 
    transform(role = ifelse(group_id==3, "Treatment", "Control"), comp = "1.2. Late vs Untreated"),
  dat4_rev |> subset(group_id %in% c(2,3) & period_id<8) |> 
    transform(role = ifelse(group_id==2, "Treatment", "Control"), comp = "2.1. Early vs Late"),
  dat4_rev |> subset(group_id %in% c(2:3) & period_id>4) |> 
    transform(role = ifelse(group_id==3, "Treatment", "Control"), comp = "2.2. Late vs Early")
) |>
  ggplot(aes(period_id, outcome, group = group_id, col = factor(group_id), lty = role)) +
  geom_point() + geom_line() +
  facet_wrap(~comp) +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  scale_linetype_manual(values = c("Control" = 5, "Treatment" = 1)) +
  labs(x = "Time period", y = "Outcome variable", col = "ID", lty = "Role") +
  theme_light(base_size = 14) 

Goodman-Bacon decomposition

The Goodman-Bacon decomposition reveals a fundamental issue with two-way fixed effects (TWFE) difference-in-differences estimators when treatment timing varies across units:

the overall treatment effect estimate is a weighted average of all possible 2×2 DD comparisons, including problematic ones where already-treated units serve as controls for newly-treated units.

Goodman-Bacon decomposition

How to explain the failure of TWFE regression to compute the correct ATT when we have differential treatment timing or heterogeneous effects? One explanation is that the treated values bias the computation of the fixed effects.

Looking at the group-time outcomes \(y_{gt}\) (since we have just one unit of data per group-time):

show data by group-time
dmnms <- 
  list(
    paste("group",as.character(1:3))
    , as.character(1:10)
    )
dat4_rev |> dplyr::arrange(group_id) |> 
  dplyr::pull(outcome) |> round(2) |> 
  matrix(nrow = 3,byrow =T, dimnames = dmnms) |> 
  as.data.frame() |> 
  knitr::kable("html",digits=2,table.attr = "style='width:40%;'") |> 
  kableExtra::column_spec(column = 1, width = "5em") |> 
  kableExtra::column_spec(column = 2:11, width = "3em") |> 
  kableExtra::kable_styling(full_width = F) |> 
  kableExtra::footnote(footnote_as_chunk=T, general = "group 1 is never treated.")
1 2 3 4 5 6 7 8 9 10
group 1 2 3 4 5 6 7 8 9 10 11
group 2 3 4 5 6 9 10 11 12 13 14
group 3 4 5 6 7 8 9 10 15 16 17
Note: group 1 is never treated.

Outcome data by group-time

Goodman-Bacon decomposition

By definition fixed effects are constant by group and by time. By convention, the group effects are zero for group-times (1,.) and the time effects are zero for group-times (.,1). Regressing the output against the fixed effects, should estimate the fixed effect, and the residual outcomes should then just be the causal effects, but this does not happen if we use all the outcomes in the regression:

show residuals by group-time using all outcome data
dat4_rev |> dplyr::arrange(group_id) |> 
  dplyr::pull(outcome_resid) |> round(2) |> 
  matrix(nrow = 3,byrow =T, dimnames = dmnms) |> 
  as.data.frame() |> 
  knitr::kable("html", digits=2, table.attr = "style='width:40%;'") |> 
  kableExtra::column_spec(column = 1, width = "5em") |> 
  kableExtra::column_spec(column = 2:11, width = "3em") |> 
  kableExtra::kable_styling(full_width = F) |> 
  kableExtra::footnote(footnote_as_chunk=T, general = "group 1 is never treated.")
1 2 3 4 5 6 7 8 9 10
group 1 0.8 0.8 0.8 0.8 0.13 0.13 0.13 -1.2 -1.2 -1.2
group 2 -0.4 -0.4 -0.4 -0.4 0.93 0.93 0.93 -0.4 -0.4 -0.4
group 3 -0.4 -0.4 -0.4 -0.4 -1.07 -1.07 -1.07 1.6 1.6 1.6
Note: group 1 is never treated.

Residual outcome data by group-time, estimates using all outomes

Goodman-Bacon decomposition

However, if we use only the non-treated data to estimate the fixed effects, we get what we expect under the parallel trends assumption: after removing fixed effects, all that is left is treatment effects (easy to see with only one unit per group-time).

show residuals by group-time using only non-treated outcome data
# Stage 1: Estimate fixed effects using ONLY UNTREATED observations
untreated_data4_rev <- dat4_rev[dat4_rev$treated == 0, ]
stage1 <- fixest::feols(outcome ~ 1 | group_id + period_id, data = untreated_data4_rev)
# compute residuals
dat4_rev$outcome_resid_gardner <- 
  dat4_rev$outcome - predict(stage1, newdata = dat4_rev)

dat4_rev |> dplyr::arrange(group_id) |> 
  dplyr::pull(outcome_resid_gardner) |> round(2) |> 
  matrix(nrow = 3,byrow =T, dimnames = dmnms) |> 
  as.data.frame() |> 
  knitr::kable("html",digits=2,table.attr = "style='width:40%;'") |> 
  kableExtra::column_spec(column = 1, width = "5em") |> 
  kableExtra::column_spec(column = 2:11, width = "3em") |> 
  kableExtra::kable_styling(full_width = F) |> 
  kableExtra::footnote(footnote_as_chunk=T, general = "group 1 is never treated.")
1 2 3 4 5 6 7 8 9 10
group 1 0 0 0 0 0 0 0 0 0 0
group 2 0 0 0 0 2 2 2 2 2 2
group 3 0 0 0 0 0 0 0 4 4 4
Note: group 1 is never treated.

Residual outcome data by group-time, estimates only non-treated outomes

Two-stage DiD

We can now regress these residuals against the treatment indicator and estimate the ATT:

estimating fixed effects with untreated only
# Stage 2: Regress residualized outcome on treatment
stage2 <- fixest::feols(outcome_resid_gardner ~ treated, data = dat4_rev)

cat("Gardner Two-Stage Estimate:", round(coef(stage2)[2], 6), "\n")
Gardner Two-Stage Estimate: 2.667 

Two-stage DiD

Which is the ATT as constructed in this example.

This two-stage DiD calculation is implemented in the did2s package, which we call as follows:

estimate treatment effects with 2sdid
# output: asis
res <- did2s::did2s(data = dat4_rev
  , yname = 'outcome', treatment = 'treated'
  , first_stage = ~ 0 | group_id + period_id
  , second_stage = ~i(treated, ref=FALSE)
  , cluster_var = 'group_id', verbose = FALSE
)
# fixest::etable(
#   res, fitstat=c('n'), markdown = "images/" #tex = FALSE
# )
as.list(res) |> broom::tidy() |> gt::gt()
term estimate std.error statistic p.value
treated::1 2.667 0.6285 4.243 0.0002063

Two-stage DiD

and did2s will also prepare an event study plot:

event study using 2sdid
out <- did2s::event_study(
    data = dat4_rev |> 
      dplyr::mutate(
        g = dplyr::case_when(
          group_id == 1 ~ 0
          , group_id == 5 ~ 0
          , TRUE ~ 8
        )
      )
  , yname = 'outcome'
  , idname = 'group_id'
  , gname = 'g'
  , tname = 'period_id'
  , estimator = "did2s"
)
did2s::plot_event_study(out) + 
  theme_light(base_size = 18) + 
  theme(axis.title = element_text(size = 14, face = "bold"), legend.position="none")

More

Recap

We looked in some detail at other methods used to estimate causal effects, along with methods used for special situations, including:

  • regression adjustment
  • doubly robust estimation
  • matching
  • difference in differences
  • fixed effects and methods for panel data