Potential outcomes selection bias - Cross Validatedmost recent 30 from stats.stackexchange.com2022-05-28T06:32:43Zhttps://stats.stackexchange.com/feeds/question/554113https://creativecommons.org/licenses/by-sa/4.0/rdfhttps://stats.stackexchange.com/q/5541134Potential outcomes selection biasgiachttps://stats.stackexchange.com/users/835852021-11-29T12:44:23Z2021-11-30T15:17:08Z
<p>I have an issue with the derivation of the <strong>selection bias</strong> from the Rubin potential outcomes framework.</p>
<p>I am looking at <a href="https://www.mattblackwell.org/files/teaching/s02-experiment-handout.pdf" rel="nofollow noreferrer">this slide</a> who shows how to go from a difference in means between treated and controls to the ATT + selection bias.</p>
<p>However, I don't get how to go from Step 2 to Step 3.</p>
<p>Why is <strong>consistency</strong> invoked here?</p>
<p>And how do you go from a <strong>subtraction</strong> in Step 1 to an <strong>addition</strong> in Step 2.</p>
<p><a href="https://i.stack.imgur.com/enNoZ.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/enNoZ.png" alt="enter image description here" /></a></p>
https://stats.stackexchange.com/questions/554113/-/554119#5541196Answer by periwinkle for Potential outcomes selection biasperiwinklehttps://stats.stackexchange.com/users/2318862021-11-29T14:35:18Z2021-11-29T14:35:18Z<p>It's really important to distinguish the <strong>observed</strong> outome, <span class="math-container">$Y_i$</span>, and the <strong>potential outcomes</strong>, <span class="math-container">$(Y_i(1), Y_i(0))$</span>.</p>
<p>The observed outcomes are, well, simply the outcome you observed for each subject.</p>
<p>The potential outcomes are the outcomes <em>that you would observe</em> if a patient was given a certain treatment.
If you have two versions of a treatment (say <span class="math-container">$0$</span> and <span class="math-container">$1$</span> for control and experimental treatment) then each subject has two potential outcomes <span class="math-container">$Y_i(1)$</span> and <span class="math-container">$Y_i(0)$</span> but <strong>only one observed outcome</strong>. These potential outcomes are interesting to derive causal interpretation of the treatment effect.</p>
<p>The consistency implies that the observed outcome of a subject is the potential outcome associated with the treatment the subject effectively received. That is,</p>
<ul>
<li>If the subject <span class="math-container">$i$</span> received treatment <span class="math-container">$1$</span> then <span class="math-container">$Y_i = Y_i(1)$</span></li>
<li>If the subject <span class="math-container">$i$</span> received treatment <span class="math-container">$0$</span> then <span class="math-container">$Y_i = Y_i(0)$</span></li>
</ul>
<p>Or, said differently, if <span class="math-container">$D_i$</span> is the treatment indicator then by consistency <span class="math-container">$D_i = 1 \Rightarrow Y_i = Y_i(1)$</span> which gives
<span class="math-container">$$
E(Y_i \mid D_i = 1) = E(Y_i(1) \mid D_i = 1)
$$</span></p>
<p>Idem for <span class="math-container">$D_i = 0$</span> we have</p>
<p><span class="math-container">$$
E(Y_i \mid D_i = 0) = E(Y_i(0) \mid D_i = 0)
$$</span></p>
<p>The consistency is invoked here because the first line uses the observed outcome while the second uses the potential outcomes (which carry a "causal" interpretation).</p>
<p>The consistency may seem "obvious" but I think they are experiments for which it may not hold.</p>
<p>The step 2 is simply the step 1 plus the term <span class="math-container">$E(Y_i(0) \mid D_i = 1) - E_(Y_i(0) \mid D_i = 1)$</span> which is <span class="math-container">$0$</span>. Introducing this term allows to decompose the difference of <strong>observed outcomes</strong> between the two treatment arm
<span class="math-container">$$
A = E(Y_i \mid D_i= 1) - E(Y_i \mid D_i = 0)
$$</span></p>
<p>as
<span class="math-container">$$
B = E(Y_i(1) - Y_i(0) \mid D_i = 1)
$$</span>
plus
<span class="math-container">$$
C = E(Y_i(0) \mid D_i = 1) - E(Y_i(0) \mid D_i = 0)
$$</span>
The term "B" is "the average treatment effect on the treated" (ATT).
It's an average because of the expectation, of the treatment effect because of <span class="math-container">$Y_i(1) - Y_i(0)$</span> and "on the treated" because we take this expectation conditional on <span class="math-container">$D_i =1$</span> that is for the subjects on the experimental arm.</p>
<p>The term "C" is labelled as "selection bias" which means that subjects on the experimental arm and on the control arm may be different due to selection.</p>
<p>If there was no selection bias (as in a randomized treatment) both groups would be, on average, alike.
You could have given the experimental treatment to one or the other group that this would have not changed (again, on average) the results: both groups are said to be <strong>exchangeable</strong>.</p>
<p>This translates as "<strong>the potential outcomes are the same on average for between the two groups</strong>":</p>
<p><span class="math-container">\begin{gather*}
E(Y_i(1) \mid D_i = 1) = E(Y_i(1) \mid D_i = 0) \\
E(Y_i(0) \mid D_i = 1) = E(Y_i(0) \mid D_i = 0)
\end{gather*}</span></p>
<p>If there is no selection then <span class="math-container">$E(Y_i(0) \mid D_i = 1) - E(Y_i(0) \mid D_i = 0) = 0$</span> and the observed difference between both groups (A) is equal to the ATT (B).</p>
https://stats.stackexchange.com/questions/554113/-/554252#5542520Answer by giac for Potential outcomes selection biasgiachttps://stats.stackexchange.com/users/835852021-11-30T15:17:08Z2021-11-30T15:17:08Z<p>Just to build on @periwinkle answer, I want to emphasise that this equation is solved by an algebraic trick by adding <span class="math-container">$E[Y(0) | D = 1]$</span> twice. Professors do not make this clear enough. You see that often in math actually, the use of "tricks" to solve equations (for instance adding <span class="math-container">$\cdot1$</span> can help solve equations).</p>
<p>What is remarkable and not often talked about is how the Rubin framework has allowed this "discovery" (naive group average = ATT + selection).</p>
<p>Rubin framework is only one framework among others (like Pearl's framework) for understanding causality. But the genius of Rubin is to have used Neyman's notation <span class="math-container">$Y(1), Y(0)$</span> and the idea of potential outcomes to be able to derive this result.</p>
<p>Is it amazing that by conceiving causality formally with this notation, like <span class="math-container">$Y(1) | D = 1$</span>, you are able to derive such an insight.</p>
<p>As said in the comment the trick can be used the other way around by adding <span class="math-container">$E[Y(0) | D = 1]$</span> twice.</p>
<p><span class="math-container">\begin{equation}
\begin{split}
&= E[Y(1) \mid D = 1] - E[Y(0) \mid D = 0] \\
&= E[Y(1) \mid D = 1] + \color{red}{E[Y(1) \mid D = 0] - E[Y(1) \mid D = 0]} - E[Y(0) \mid D = 0] \\
&= E[Y(1) \mid D = 1] + \color{red}{E[Y(1) \mid D = 0]} - E[\color{red}{Y(1)} - Y(0) \mid D = 0] \\
&= ATC + \text{baseline bias}
\end{split}
\end{equation}</span></p>
<p>It worth appreciate this theoretical result with a simulation. I made a simple one with R.</p>
<pre><code>library(tidyverse)
# U affects both treatment and outcome (Y).
# simulation #
set.seed(123)
n = 1000
u = rnorm(n, 1, 1) # generate latent variable affecting treatment selection
d = rbinom(n, 1, prob = plogis(u)) # treatment
e = rnorm(n, 0, 1)
# homogenous causal effect #
causal_effect = 2
# Y(0)
y0 = u + e
# Y(1)
y1 = causal_effect + u + e
# Observed Y
yobs = d*causal_effect + u + e
#
df = data.frame(u, d, y0, y1, yobs)
#
# individual causal effect #
df<span class="math-container">$y1y0 = df$</span>y1 - df<span class="math-container">$y0
# grand mean #
mean(df$</span>y1y0)
# ATT / ATC #
df %>% group_by(d) %>% summarise(mean(y1 - y0))
# naive group comparison #
naive_group_comparison = df %>% group_by(d) %>% summarise(m = mean(yobs)) %>% summarise(m[d == 1] - m[d == 0])
naive_group_comparison
# ATT
`Y(1) Y(0) | D =1` = df[df$d==1,] %>% summarise(mean(y1 - y0))
# ATC
`Y(1) Y(0) | D =0` = df[df$d==0,] %>% summarise(mean(y1 - y0))
# unobserved quantities #
`Y(0) | D =1` = df[df<span class="math-container">$d==1,] %>% summarise(mean(y0))
`Y(1) | D =0` = df[df$</span>d==0,] %>% summarise(mean(y1))
# observed
`Y(1) | D =1` = df[df<span class="math-container">$d==1,] %>% summarise(mean(y1))
`Y(0) | D =0` = df[df$</span>d==0,] %>% summarise(mean(y0))
# ATT
selection_bias_1 = `Y(0) | D =1` - `Y(0) | D =0`
# ATT + selection bias 1 #
`Y(1) Y(0) | D =1` + selection_bias_1
# We get the correct result as
# naive_group_comparison
# ATC
selection_bias_2 = `Y(1) | D =0` - `Y(1) | D =1`
# ATC + selection bias 1 #
`Y(1) Y(0) | D =1` + selection_bias_2
</code></pre>