0%

Conditional Independence

1. Definition

Definition 1.1    Let \(X, Y\) be r.v.s. defined on \(\mathcal{X} \times \mathcal{Y}\). We denote the joint density \(p(x, y)\), and call \[p(y)=\int_\mathcal{X} p(x, y)\text{d}x\] the marginal density of \(Y\). The conditional density of \(X\) given \(Y\) is defined as any function \(p(x \mid y)\) s.t. \[p(x, y)=p(y)p(x \mid y).\]

If \(p(y)>0\) then the solution is unique and given by the familiar expression \[p(x \mid y)=\frac{p(x, y)}{p(y)}.\]

Definition 1.2    Let \(X, Y\) be r.v.s. defined on \(\mathcal{X} \times \mathcal{Y}\). \(X\) and \(Y\) are marginally independent if \[p(x \mid y)=p(x), \quad \forall x \in \mathcal{X}, y \in \mathcal{Y}, p(y)>0,\] which is equivalent to \[p(x, y)=p(x)p(y).\]

Definition 1.3    Let \(X, Y, Z\) be r.v.s. defined on a Cartesian product space and the joint density be \(p(x, y, z)\). We say that \(X\) and \(Y\) are conditionally independent given \(Z\) if \[p(x \mid y, z)=p(x \mid z), \quad \forall x \in \mathcal{X}, y \in \mathcal{Y}, z \in \mathcal{Z}\ \text{s.t.}\ p(y, z)>0,\] denoted \(X \perp\!\!\!\!\perp Y \mid Z\).

If \(Z\) is degenerate, i.e., there is some \(z\) s.t. \(P(Z=z)=1\), then the definition above is the same as saying that \(X \perp\!\!\!\!\perp Y\).

Example 1.1    Let \(X_1, \ldots, X_k\) be a Markov chain. Then \(X_k\) is independent of \(X_1, \ldots, X_{k-2}\) conditional upon \(X_{k-1}\): \[P(X_k=x_k \mid X_{k-1}=x_{k-1}, \ldots, X_1=x_1)=P(X_k=x_k \mid X_{k-1}=x_{k-1})\] for all \(x_k, x_{k-1}, \ldots, x_1\), i.e., \(X_k \perp\!\!\!\!\perp X_1, \ldots, X_{k-2} \mid X_{k-1}\), a.k.a. Markov property, or memoryless property.

Example 1.2    Suppose \(\mathbf{X}_V=(X_1, \ldots, X_p)^\top\) is a multivariate Gaussian distribution, i.e., \[f(\mathbf{x}_V; \mu, \Sigma)=\frac{1}{(2\pi)^{p/2}|\Sigma|^{p/2}}\exp\left[-\frac{1}{2}(\mathbf{x}_V-\mu)^\top\Sigma^{-1}(\mathbf{x}_V-\mu)\right].\] Then \[X_p \mid X_1=x_1, \ldots, X_{p-1}=x_{p-1} \sim \mathcal{N}(\mu_p+\Sigma_{p, -p}\Sigma_{-p, -p}^{-1}(\mathbf{x}_{-p}-\mu_{-p}), \sigma_{pp \cdot 1 \cdots p-1}),\] where \(\sigma_{aa \cdot B}=\sigma_{aa}-\Sigma_{aB}\Sigma_{BB}^{-1}\Sigma_{Ba}\), is called the Schur complement of entry \((a, a)\) in \(\Sigma\).

Theorem 1.1    Let \(X, Y, Z\) be r.v.s. defined on a Cartesian product space. The following are equivalent:

    (1) \(p(x \mid y, z)=p(x \mid z)\) for all \(x, y, z\) s.t. \(p(y, z)>0\);

    (2) \(p(x, y \mid z)=p(x \mid z)p(y \mid z)\) for all \(x, y, z\) s.t. \(p(z)>0\);

    (3) \(p(x, y, z)=p(y, z)p(x \mid z)\) for all \(x, y, z\) s.t. \(p(z)>0\);

    (4) \(p(z)p(x, y, z)=p(x, z)p(y, z)\) for all \(x, y, z\);

    (5) \(p(x, y, z)=f(x, z)g(y, z)\) for some functions \(f, g\) and all \(x, y, z\).

Proof. We need to show (1) \(\Rightarrow\) (2) \(\Rightarrow\) (3) \(\Rightarrow\) (1), (3) \(\Leftrightarrow\) (4), and (3) \(\Leftrightarrow\) (5).

(i) Since \(p(y, z)>0\), then \(p(z)>0\). Since \(p(x \mid y, z)p(y \mid z)=p(x, y \mid z)\), then \[p(x, y \mid z)=p(x \mid z)p(y \mid z),\] i.e., (1) \(\Rightarrow\) (2).

(ii) We have \[p(x, y, z)=p(x, y \mid z)p(z)=p(x \mid z)p(y \mid z)p(z)=p(y, z)p(x \mid z)\] for all \(x, y, z\) s.t. \(p(z)>0\), i.e., (2) \(\Rightarrow\) (3).

(iii) It is obvious to show that (3) \(\Rightarrow\) (1), (3) \(\Leftrightarrow\) (4), and (3) \(\Rightarrow\) (5).

(iv) Suppose \(p(x, y, z)=f(x, z)g(y, z)\) for some functions \(f, g\) and all \(x, y, z\). We have \[p(y, z)=\int_\mathcal{X} p(x, y, z)\text{d}x=g(y, z)\int_\mathcal{X} f(x, z)\text{d}x:=g(y, z)\widetilde{f}(z).\] Then \[p(x, y, z)=\frac{f(x, z)p(y, z)}{\widetilde{f}(z)}.\]

Since \[p(x \mid y, z)=\frac{p(x, y, z)}{p(y, z)}=\frac{f(x, z)}{\widetilde{f}(z)}=p(x \mid z),\] then we have (5) \(\Rightarrow\) (1) \(\Rightarrow\) (3).

\(\square\)

2. Graphoid Axioms

Theorem 2.1 (Graphoid Axioms)    Conditional independence satisfies the following properties:

    (1) Symmetry: \(X \perp\!\!\!\!\perp Y \mid Z \Rightarrow Y \perp\!\!\!\!\perp X \mid Z\);

         (\(X \perp\!\!\!\!\perp Y \Rightarrow Y \perp\!\!\!\!\perp Z\))

    (2) Decomposition: \(X \perp\!\!\!\!\perp Y, W \mid Z \Rightarrow X \perp\!\!\!\!\perp Y \mid Z\);

         (\(X \perp\!\!\!\!\perp Y, W \Rightarrow X \perp\!\!\!\!\perp Y\) and \(X \perp\!\!\!\!\perp W\))

    (3) Weak union: \(X \perp\!\!\!\!\perp Y, W \mid Z \Rightarrow X \perp\!\!\!\!\perp W \mid Y, Z\);

         (\(X \perp\!\!\!\!\perp Y, W \Rightarrow X \perp\!\!\!\!\perp W \mid Y\) and \(X \perp\!\!\!\!\perp Y \mid W\))

    (4) Contraction: \(X \perp\!\!\!\!\perp Y \mid Z\) and \(X \perp\!\!\!\!\perp W \mid Y, Z \Rightarrow X \perp\!\!\!\!\perp Y, W \mid Z\);

         (\(X \perp\!\!\!\!\perp Y\) and \(X \perp\!\!\!\!\perp W \mid Y \Rightarrow X \perp\!\!\!\!\perp Y, W\))

    (5) Intersection: if \(p(x, y, z, w)>0\), then \(X \perp\!\!\!\!\perp W \mid Y, Z\) and \(X \perp\!\!\!\!\perp Y \mid W, Z \Rightarrow X \perp\!\!\!\!\perp Y, W \mid Z\).

Proof.

(i) Since \(X \perp\!\!\!\!\perp Y \mid Z\), then \(p(x \mid y, z)=p(x \mid z)\), and thus \(p(x, y, z)=p(x, z)p(y \mid z)\). Hence, \[p(y \mid x, z)=\frac{p(x, y, z)}{p(x, z)}=\frac{p(x, z)p(y \mid z)}{p(x, z)}=p(y \mid z),\] i.e., \(Y \perp\!\!\!\!\perp X \mid Z\).

(ii) Since \(X \perp\!\!\!\!\perp Y, W \mid Z\), then \(p(x \mid y, w, z)=p(x \mid z)\), and thus \(p(x, y, w \mid z)=p(x \mid z)p(y, w \mid z)\). Hence, \[p(x, y \mid z)=\int_\mathcal{W} p(x, y, w \mid z)=p(x \mid z)\int_\mathcal{W} p(y, w \mid z)=p(x \mid z)p(y \mid z),\] i.e., \(X \perp\!\!\!\!\perp Y \mid Z\).

(iii) Since \(X \perp\!\!\!\!\perp Y, W \mid Z\), then \(X \perp\!\!\!\!\perp Y \mid Z\) and thus \(p(x \mid y, z)=p(x \mid z)\). Besides, \(p(x \mid y, w, z)=p(x \mid z)\). Hence, \(p(x \mid y, w, z)=p(x \mid y, z)\), i.e., \(X \perp\!\!\!\!\perp W \mid Y, Z\).

(iv) Since \(X \perp\!\!\!\!\perp Y \mid Z\) and \(X \perp\!\!\!\!\perp W \mid Y, Z\), then \(p(x \mid y, z)=p(x \mid z)\) and \(p(x \mid w, y, z)=p(x \mid y, z)\). Hence, \(p(x \mid w, y, z)=p(x \mid z)\), i.e., \(X \perp\!\!\!\!\perp W, Y \mid Z\).

(v) Since \(X \perp\!\!\!\!\perp W \mid Y, Z\) and \(X \perp\!\!\!\!\perp Y \mid W, Z\), then \(p(x, w, y, z)=f(x, y, z)g(w, y, z)\) for some functions \(f, g\), and \(p(x, y, w, z)=\widetilde{f}(x, w, z)\widetilde{g}(y, w, z)\) for some functions \(\widetilde{f}, \widetilde{g}\). Because of positivity, we have \[f(x, y, z)=\frac{\widetilde{f}(x, w, z)\widetilde{g}(y, w, z)}{g(w, y, z)}=\frac{\widetilde{f}(x, w_0, z)\widetilde{g}(y, w_0, z)}{g(w_0, y, z)}\] for any \(w_0\) since \(f(x, y, z)\) does not depend on \(w\). Then \(f(x, y, z)=a(x, z)b(y, z)\), and thus \[p(x, w, y, z)=a(x, z)b(y, z)g(w, y, z)=\widetilde{a}(x, z)\widetilde{b}(y, w, z).\] Therefore, \(X \perp\!\!\!\!\perp Y, W \mid Z\).

\(\square\)

  • Symmetry means that if \(Y\) is irrelevant to \(X\) then \(X\) is irrelevant to \(Y\).

  • Decomposition means that if some information is irrelevant to \(X\) then part of it is also irrelevant to \(X\).

  • Weak union means that if \(Y\) and \(W\) together are irrelevant to \(X\) then learning one of them does not make the other one relevant.

  • Contraction means that if \(Y\) is irrelevant to \(X\), and if learning \(Y\) does not make \(W\) relevant to \(X\), then both \(Y\) and \(W\) are irrelevant to \(X\).

3. Functional Conditional Independence

Remark 3.1    Since the events \(\{Y=y\}\) and \(\{Y=y, h(Y)=h(y)\}\) are equal for any measurable function \(h\), it follows that \[p(x \mid y, z)=p(x \mid y, h(y), z).\]

Example 3.1    Suppose \(X \sim f_\theta\) for some parameter \(\theta \in \Theta\). We say that \(T=t(X)\) is a sufficient statistic for \(\theta\) if the likelihood can be written as \(L(\theta \mid X=x)=f_\theta(x)=g(t(x), \theta)h(x)\). Under a Bayesian interpretation of \(\theta\), this is equivalent to saying \(X \perp\!\!\!\!\perp \theta \mid T\).

Proof. We have \[\begin{aligned} p(\theta \mid x)&=\frac{p(x \mid \theta)p(\theta)}{\int_\Theta p(x \mid \phi)p(\phi)\text{d}\phi} \\&=\frac{g(t(x), \theta)h(x)p(\theta)}{\int_\Theta g(t(x), \phi)h(x)p(\phi)\text{d}\phi} \\&=\frac{g(t(x), \theta)p(\theta)}{\int_\Theta g(t(x), \phi)p(\phi)\text{d}\phi} \\&=p(\theta \mid t(x)). \end{aligned}\] Since \(p(\theta \mid x)=p(\theta \mid x, t(x))\), then \(\theta \perp\!\!\!\!\perp X \mid T\).

\(\square\)

4. 关于多元高斯分布的补充

例1.2,我们对多元高斯分布进行补充.

4.1. 条件高斯分布

假设\(\mathbf{x} \in \mathbb{R}^p\)服从\(\mathcal{N}(\mu, \Sigma)\). 将\(\mathbf{x}\)和对应的均值\(\mu\)做如下划分:\(\mathbf{x}=\begin{bmatrix} \mathbf{x}_a \\ \mathbf{x}_b \end{bmatrix}\)\(\mu=\begin{bmatrix} \mu_a \\ \mu_b \end{bmatrix}\),其中\(\mathbf{x}_a, \mu_a \in \mathbb{R}^k\)\(\mathbf{x}_b, \mu_b \in \mathbb{R}^{p-k}\). 将协方差矩阵\(\Sigma\)划分为\(\Sigma=\begin{bmatrix} \Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix}\). 由\(\Sigma\)的对称性易知\(\Sigma_{aa}\)\(\Sigma_{bb}\)对称,且\(\Sigma_{ba}=\Sigma_{ab}^\top\).

定义精度矩阵\(\Lambda=\Sigma^{-1}\),并将其划分为\(\begin{bmatrix} \Lambda_{aa} & \Lambda_{ab} \\ \Lambda_{ba} & \Lambda_{bb} \end{bmatrix}\). 由于对称矩阵的逆仍然对称,故\(\Lambda_{aa}\)\(\Lambda_{bb}\)对称,且\(\Lambda_{ba}=\Lambda_{ab}^\top\).

注意,对于一个一般的高斯分布,其指数项可以写成\[(\mathbf{x}-\mu)^\top\Lambda(\mathbf{x}-\mu)=\mathbf{x}^\top\Lambda\mathbf{x}-2\mathbf{x}^\top\Lambda\mu+C.\]

下面考虑条件分布\(p(\mathbf{x}_a \mid \mathbf{x}_b)\),只考虑指数项:\[(\mathbf{x}-\mu)^\top\Lambda(\mathbf{x}-\mu)=(\mathbf{x}_a-\mu_a)^\top\Lambda_{aa}(\mathbf{x}_a-\mu_a)+2(\mathbf{x}_a-\mu_a)^\top\Lambda_{ab}(\mathbf{x}_b-\mu_b)+(\mathbf{x}_b-\mu_b)^\top\Lambda_{bb}(\mathbf{x}_b-\mu_b).\] 由于\(\mathbf{x}_b\)固定,我们可以将上式视为关于\(\mathbf{x}_a\)的函数,不难看出上式仍是一个二次型,故对应的条件分布\(p(\mathbf{x}_a \mid \mathbf{x}_b)\)仍是一个高斯分布. 改写可得\[\begin{aligned} (\mathbf{x}-\mu)^\top\Lambda(\mathbf{x}-\mu)&=\mathbf{x}_a^\top\Lambda_{aa}\mathbf{x}_a-2\mathbf{x}_a^\top\Lambda_{aa}\mu_a+2\mathbf{x}_a^\top\Lambda_{ab}\mathbf{x}_b-2\mathbf{x}_a^\top\Lambda_{ab}\mu_b+C \\&=\mathbf{x}_a^\top\Lambda_{aa}\mathbf{x}_a-2\mathbf{x}_a^\top(\Lambda_{aa}\mu_a-\Lambda_{ab}(\mathbf{x}_b-\mu_b))+C \\&=\mathbf{x}_a^\top\Lambda_{aa}\mathbf{x}_a-2\mathbf{x}_a^\top\Lambda_{aa}(\mu_a-\Lambda_{aa}^{-1}\Lambda_{ab}(\mathbf{x}_b-\mu_b))+C. \end{aligned}\]

因此\(\boxed{\mathbb{E}[\mathbf{x}_a \mid \mathbf{x}_b]=\mu_a-\Lambda_{aa}^{-1}\Lambda_{ab}(\mathbf{x}_b-\mu_b)}\)以及\(\boxed{\text{Var}[\mathbf{x}_a \mid \mathbf{x}_b]=\Lambda_{aa}^{-1}}\).

利用分块矩阵的恒等式\[\begin{bmatrix} A & B \\ C & D \end{bmatrix}^{-1}=\begin{bmatrix} M & -MBD^{-1} \\ -D^{-1}CM & D^{-1}+D^{-1}CMBD^{-1} \end{bmatrix}\]其中\[M=(A-BD^{-1}C)^{-1}.\] 因为\(\begin{bmatrix} \Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix}^{-1}=\begin{bmatrix} \Lambda_{aa} & \Lambda_{ab} \\ \Lambda_{ba} & \Lambda_{bb} \end{bmatrix}\),所以\[\Lambda_{aa}=(\Sigma_{aa}-\Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba})^{-1}\]以及\[\Lambda_{ab}=-(\Sigma_{aa}-\Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba})^{-1}\Sigma_{ab}\Sigma_{bb}^{-1}.\]

因此\(\boxed{\mathbb{E}[\mathbf{x}_a \mid \mathbf{x}_b]=\mu_a+\Sigma_{ab}\Sigma_{bb}^{-1}(\mathbf{x}_b-\mu_b)}\)以及\(\boxed{\text{Var}[\mathbf{x}_a \mid \mathbf{x}_b]=\Sigma_{aa}-\Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba}}\).

4.2. 边缘高斯分布

\(p(\mathbf{x}_a, \mathbf{x}_b)\)为联合高斯分布,则其边缘概率分布为\[p(\mathbf{x}_a)=\int p(\mathbf{x}_a, \mathbf{x}_b)\text{d}\mathbf{x}_b\]也是一个高斯分布.

由于上式涉及到关于\(\mathbf{x}_b\)的积分,因此\[\begin{aligned} (\mathbf{x}-\mu)^\top\Lambda(\mathbf{x}-\mu)&=\mathbf{x}_b^\top\Lambda_{bb}\mathbf{x}_b-2\mathbf{x}_b^\top\Lambda_{bb}\mu_b+2\mathbf{x}_b^\top\Lambda_{ba}\mathbf{x}_a-2\mathbf{x}_b^\top\Lambda_{ba}\mu_a+C \\&=\mathbf{x}_b^\top\Lambda_{bb}\mathbf{x}_b-2\mathbf{x}_b^\top(\Lambda_{bb}\mu_b-\Lambda_{ba}(\mathbf{x}_a-\mu_a))+C \\&=\mathbf{x}_b^\top\Lambda_{bb}\mathbf{x}_b-2\mathbf{x}_b^\top\mathbf{m}+C \\&=(\mathbf{x}_b-\Lambda_{bb}^{-1}\mathbf{m})^\top\Lambda_{bb}(\mathbf{x}_b-\Lambda_{bb}^{-1}\mathbf{m})-\mathbf{m}^\top\Lambda_{bb}^{-1}\mathbf{m}+C. \end{aligned}\]

所以,关于\(\mathbf{x}_b\)的积分为\[\begin{aligned} \int p(\mathbf{x}_a, \mathbf{x}_b)\text{d}\mathbf{x}_b &\propto \int \exp\left[(\mathbf{x}_b-\Lambda_{bb}^{-1}\mathbf{b})^\top\Lambda_{bb}(\mathbf{x}_b-\Lambda_{bb}^{-1}\mathbf{m})-\mathbf{m}^\top\Lambda_{bb}^{-1}\mathbf{m}\right]\text{d}\mathbf{x}_b \\&\propto \exp\left[-\mathbf{m}^\top\Lambda_{bb}^{-1}\mathbf{m}\right]. \end{aligned}\]

于是,边缘高斯分布中关于\(\mathbf{x}_a\)的指数项为\[\begin{aligned} &\ \ \ \ \ -\mathbf{m}^\top\Lambda_{bb}^{-1}\mathbf{m}+\mathbf{x}_a^\top\Lambda_{aa}\mathbf{x}_a-2\mathbf{x}_a^\top\Lambda_{aa}\mu_a-2\mathbf{x}_a^\top\Lambda_{ab}\mu_b \\&=-\mathbf{m}^\top\Lambda_{bb}^{-1}\mathbf{m}+\mathbf{x}_a^\top\Lambda_{aa}\mathbf{x}_a-2\mathbf{x}_a^\top(\Lambda_{aa}\mu_a+\Lambda_{ab}\mu_b) \\&=-(\Lambda_{bb}\mu_b-\Lambda_{ba}(\mathbf{x}_a-\mu_a))^\top\Lambda_{bb}^{-1}(\Lambda_{bb}\mu_b-\Lambda_{ba}(\mathbf{x}_a-\mu_a))+\mathbf{x}_a^\top\Lambda_{aa}\mathbf{x}_a-2\mathbf{x}_a^\top(\Lambda_{aa}\mu_a+\Lambda_{ab}\mu_b) \\&=\mathbf{x}_a^\top(\Lambda_{aa}-\Lambda_{ab}\Lambda_{bb}^{-1}\Lambda_{ba})\mathbf{x}_a-2\mathbf{x}_a^\top(\Lambda_{aa}-\Lambda_{ab}\Lambda_{bb}^{-1}\Lambda_{ba})\mu_a+C. \end{aligned}\]

因此,\(\boxed{\mathbb{E}[\mathbf{x}_a]=\mu_a}\)以及\(\boxed{\text{Var}[\mathbf{x}_a]=(\Lambda_{aa}-\Lambda_{ab}\Lambda_{bb}^{-1}\Lambda_{ba})^{-1}=\Sigma_{aa}}\).