Potential outcomes selection bias

I have an issue with the derivation of the selection bias from the Rubin potential outcomes framework.

I am looking at this slide who shows how to go from a difference in means between treated and controls to the ATT + selection bias.

However, I don't get how to go from Step 2 to Step 3.

Why is consistency invoked here?

And how do you go from a subtraction in Step 1 to an addition in Step 2.

It's really important to distinguish the observed outome, $$Y_i$$, and the potential outcomes, $$(Y_i(1), Y_i(0))$$.

The observed outcomes are, well, simply the outcome you observed for each subject.

The potential outcomes are the outcomes that you would observe if a patient was given a certain treatment. If you have two versions of a treatment (say $$0$$ and $$1$$ for control and experimental treatment) then each subject has two potential outcomes $$Y_i(1)$$ and $$Y_i(0)$$ but only one observed outcome. These potential outcomes are interesting to derive causal interpretation of the treatment effect.

The consistency implies that the observed outcome of a subject is the potential outcome associated with the treatment the subject effectively received. That is,

• If the subject $$i$$ received treatment $$1$$ then $$Y_i = Y_i(1)$$
• If the subject $$i$$ received treatment $$0$$ then $$Y_i = Y_i(0)$$

Or, said differently, if $$D_i$$ is the treatment indicator then by consistency $$D_i = 1 \Rightarrow Y_i = Y_i(1)$$ which gives $$E(Y_i \mid D_i = 1) = E(Y_i(1) \mid D_i = 1)$$

Idem for $$D_i = 0$$ we have

$$E(Y_i \mid D_i = 0) = E(Y_i(0) \mid D_i = 0)$$

The consistency is invoked here because the first line uses the observed outcome while the second uses the potential outcomes (which carry a "causal" interpretation).

The consistency may seem "obvious" but I think they are experiments for which it may not hold.

The step 2 is simply the step 1 plus the term $$E(Y_i(0) \mid D_i = 1) - E_(Y_i(0) \mid D_i = 1)$$ which is $$0$$. Introducing this term allows to decompose the difference of observed outcomes between the two treatment arm $$A = E(Y_i \mid D_i= 1) - E(Y_i \mid D_i = 0)$$

as $$B = E(Y_i(1) - Y_i(0) \mid D_i = 1)$$ plus $$C = E(Y_i(0) \mid D_i = 1) - E(Y_i(0) \mid D_i = 0)$$ The term "B" is "the average treatment effect on the treated" (ATT). It's an average because of the expectation, of the treatment effect because of $$Y_i(1) - Y_i(0)$$ and "on the treated" because we take this expectation conditional on $$D_i =1$$ that is for the subjects on the experimental arm.

The term "C" is labelled as "selection bias" which means that subjects on the experimental arm and on the control arm may be different due to selection.

If there was no selection bias (as in a randomized treatment) both groups would be, on average, alike. You could have given the experimental treatment to one or the other group that this would have not changed (again, on average) the results: both groups are said to be exchangeable.

This translates as "the potential outcomes are the same on average for between the two groups":

$$\begin{gather*} E(Y_i(1) \mid D_i = 1) = E(Y_i(1) \mid D_i = 0) \\ E(Y_i(0) \mid D_i = 1) = E(Y_i(0) \mid D_i = 0) \end{gather*}$$

If there is no selection then $$E(Y_i(0) \mid D_i = 1) - E(Y_i(0) \mid D_i = 0) = 0$$ and the observed difference between both groups (A) is equal to the ATT (B).

• I see. I just missed the algebraic trick. So, conversely we could do the trick with adding $E[Y(1) | D = 0] - E[Y(1) | D = 0]$, which would give $E[Y(1) - Y(0) | D = 0] + E[Y(1) | D = 0] - E[Y(1) | D = 1]$, basically the ATC + selection bias. Thanks a lot!
– giac
Nov 29, 2021 at 15:31

Just to build on @periwinkle answer, I want to emphasise that this equation is solved by an algebraic trick by adding $$E[Y(0) | D = 1]$$ twice. Professors do not make this clear enough. You see that often in math actually, the use of "tricks" to solve equations (for instance adding $$\cdot1$$ can help solve equations).

What is remarkable and not often talked about is how the Rubin framework has allowed this "discovery" (naive group average = ATT + selection).

Rubin framework is only one framework among others (like Pearl's framework) for understanding causality. But the genius of Rubin is to have used Neyman's notation $$Y(1), Y(0)$$ and the idea of potential outcomes to be able to derive this result.

Is it amazing that by conceiving causality formally with this notation, like $$Y(1) | D = 1$$, you are able to derive such an insight.

As said in the comment the trick can be used the other way around by adding $$E[Y(0) | D = 1]$$ twice.

$$$$\begin{split} &= E[Y(1) \mid D = 1] - E[Y(0) \mid D = 0] \\ &= E[Y(1) \mid D = 1] + \color{red}{E[Y(1) \mid D = 0] - E[Y(1) \mid D = 0]} - E[Y(0) \mid D = 0] \\ &= E[Y(1) \mid D = 1] + \color{red}{E[Y(1) \mid D = 0]} - E[\color{red}{Y(1)} - Y(0) \mid D = 0] \\ &= ATC + \text{baseline bias} \end{split}$$$$

It worth appreciate this theoretical result with a simulation. I made a simple one with R.

library(tidyverse)

# U affects both treatment and outcome (Y).

# simulation #
set.seed(123)
n = 1000
u = rnorm(n, 1, 1) # generate latent variable affecting treatment selection
d = rbinom(n, 1, prob = plogis(u)) # treatment
e = rnorm(n, 0, 1)

# homogenous causal effect #
causal_effect = 2

# Y(0)
y0 = u + e
# Y(1)
y1 = causal_effect + u + e
# Observed Y
yobs = d*causal_effect + u + e
#
df = data.frame(u, d, y0, y1, yobs)
#

# individual causal effect #
df$$y1y0 = df$$y1 - df$$y0 # grand mean # mean(df$$y1y0)

# ATT / ATC #
df %>% group_by(d) %>% summarise(mean(y1 - y0))

# naive group comparison #
naive_group_comparison = df %>% group_by(d) %>% summarise(m = mean(yobs)) %>% summarise(m[d == 1] - m[d == 0])

naive_group_comparison

# ATT
Y(1) Y(0) | D =1 = df[df$d==1,] %>% summarise(mean(y1 - y0)) # ATC Y(1) Y(0) | D =0 = df[df$d==0,] %>% summarise(mean(y1 - y0))

# unobserved quantities #
Y(0) | D =1 = df[df$$d==1,] %>% summarise(mean(y0)) Y(1) | D =0 = df[df$$d==0,] %>% summarise(mean(y1))

# observed
Y(1) | D =1 = df[df$$d==1,] %>% summarise(mean(y1)) Y(0) | D =0 = df[df$$d==0,] %>% summarise(mean(y0))

# ATT
selection_bias_1 = Y(0) | D =1 - Y(0) | D =0
# ATT + selection bias 1 #
Y(1) Y(0) | D =1 + selection_bias_1
# We get the correct result as
# naive_group_comparison

# ATC
selection_bias_2 = Y(1) | D =0 - Y(1) | D =1
# ATC + selection bias 1 #
Y(1) Y(0) | D =1 + selection_bias_2