It's really important to distinguish the **observed** outome, $Y_i$, and the **potential outcomes**, $(Y_i(1), Y_i(0))$.

The observed outcomes are, well, simply the outcome you observed for each subject.

The potential outcomes are the outcomes *that you would observe* if a patient was given a certain treatment.
If you have two versions of a treatment (say $0$ and $1$ for control and experimental treatment) then each subject has two potential outcomes $Y_i(1)$ and $Y_i(0)$ but **only one observed outcome**. These potential outcomes are interesting to derive causal interpretation of the treatment effect.

The consistency implies that the observed outcome of a subject is the potential outcome associated with the treatment the subject effectively received. That is,

- If the subject $i$ received treatment $1$ then $Y_i = Y_i(1)$
- If the subject $i$ received treatment $0$ then $Y_i = Y_i(0)$

Or, said differently, if $D_i$ is the treatment indicator then by consistency $D_i = 1 \Rightarrow Y_i = Y_i(1)$ which gives
$$
E(Y_i \mid D_i = 1) = E(Y_i(1) \mid D_i = 1)
$$

Idem for $D_i = 0$ we have

$$
E(Y_i \mid D_i = 0) = E(Y_i(0) \mid D_i = 0)
$$

The consistency is invoked here because the first line uses the observed outcome while the second uses the potential outcomes (which carry a "causal" interpretation).

The consistency may seem "obvious" but I think they are experiments for which it may not hold.

The step 2 is simply the step 1 plus the term $E(Y_i(0) \mid D_i = 1) - E_(Y_i(0) \mid D_i = 1)$ which is $0$. Introducing this term allows to decompose the difference of **observed outcomes** between the two treatment arm
$$
A = E(Y_i \mid D_i= 1) - E(Y_i \mid D_i = 0)
$$

as
$$
B = E(Y_i(1) - Y_i(0) \mid D_i = 1)
$$
plus
$$
C = E(Y_i(0) \mid D_i = 1) - E(Y_i(0) \mid D_i = 0)
$$
The term "B" is "the average treatment effect on the treated" (ATT).
It's an average because of the expectation, of the treatment effect because of $Y_i(1) - Y_i(0)$ and "on the treated" because we take this expectation conditional on $D_i =1$ that is for the subjects on the experimental arm.

The term "C" is labelled as "selection bias" which means that subjects on the experimental arm and on the control arm may be different due to selection.

If there was no selection bias (as in a randomized treatment) both groups would be, on average, alike.
You could have given the experimental treatment to one or the other group that this would have not changed (again, on average) the results: both groups are said to be **exchangeable**.

This translates as "**the potential outcomes are the same on average for between the two groups**":

\begin{gather*}
E(Y_i(1) \mid D_i = 1) = E(Y_i(1) \mid D_i = 0) \\
E(Y_i(0) \mid D_i = 1) = E(Y_i(0) \mid D_i = 0)
\end{gather*}

If there is no selection then $E(Y_i(0) \mid D_i = 1) - E(Y_i(0) \mid D_i = 0) = 0$ and the observed difference between both groups (A) is equal to the ATT (B).