0%

Undirected Graphical Model

1. Undirected Graph

Definition 1.1    Let \(V\) be a finite set. An undirected graph is a pair \(\mathcal{G}=(V, E)\) where \(V\) is a finite set of vertices, and \(E \subseteq \{\{i, j\}: i, j \in V, i \neq j\}\) is a set of unordered distinct pairs of \(V\), called edges. If \(\{i, j\} \in E\), we say \(i\) and \(j\) are adjacent, and we write \(i \sim j\). The vertices adjacent to \(i\) are called the neighbors of \(i\), and the set of neighbors is called the boundary of \(i\), denoted \(\text{bd}_\mathcal{G}(i)\).

Example 1.1    The graph has \(V=\{1, 2, 3, 4\}\), and \(E=\{\{1, 2\}, \{2, 3\}, \{3, 4\}\}\).

            graph LR
            1((1)) --- 2((2)) --- 3((3)) --- 4((4))
          

2. Markov Property

We associate a collection of probability distributions with a graph via a Markov property. Vertices correspond to random variables, and the absence of an edge corresponds to some kind of independence.

Definition 2.1    Let \(\mathcal{G}=(V, E)\) be an undirected graph, and \(p(x_V)\) be a probability distribution over r.v.s. \(X_V \in \mathcal{X}_V=\times_{v \in V} \mathcal{X}_v\). We say that \(p\) obeys the pairwise Markov property w.r.t. \(\mathcal{G}\) if whenever \(i\) and \(j\) are not adjacent, \[X_i \perp\!\!\!\!\perp X_j \mid X_{V-\{i, j\}}\ [p].\]

Example 2.1    A distribution is pairwise Markov w.r.t. \(\mathcal{G}\) iff \(X_1 \perp\!\!\!\!\perp X_4 \mid X_2, X_3\) and \(X_2 \perp\!\!\!\!\perp X_4 \mid X_1, X_3\).

            graph LR
            1((1)) --- 3((3)) --- 4((4))
1 --- 2((2)) --- 3
          

If \(p>0\), then \(X_1, X_2 \perp\!\!\!\!\perp X_4 \mid X_3\).

Definition 2.2    A set of vertices \(C \subseteq V\) is complete if every \(i, j \in C\) are adjacent. We say \(C\) is a clique if it is an inclusion maximal complete set.

Example 2.2    \(\mathcal{C}(\mathcal{G})=\{\{1, 2\}, \{2, 3\}, \{3, 4\}, \{1, 4\}\}\).

            graph LR
            1((1)) --- 2((2))
1 --- 4((4)) --- 3((3))
2 --- 3
          

Example 2.3    \(\mathcal{C}(\mathcal{G})=\{\{1, 2, 3\}, \{3, 4\}\}\).

            graph LR
            1((1)) --- 3((3)) --- 4((4))
1 --- 2((2)) --- 3
          

Definition 2.3    We say a distribution satisfying the conditions of definition 2.1 factorizes according to \(\mathcal{G}\) if we can write \[p(x_V)=\prod_{C \in \mathcal{C}(\mathcal{G})} \psi_C(x_C), \psi_C \geq 0.\]

Example 2.4    We say the distribution factorizes according to the graph if \(p(x_{1234})=\psi_{123}(x_{123})\psi_{34}(x_{34})\), which implies \(X_1, X_2 \perp\!\!\!\!\perp X_4 \mid X_3\).

            graph LR
            1((1)) --- 3((3)) --- 4((4))
1 --- 2((2)) --- 3
          

Definition 2.4    A path is a sequence of adjacent vertices without repetition.

Definition 2.5    Let \(A, B, S \subseteq V\), we say \(A\) and \(B\) are separated by \(S\) if every path from any \(a \in A\) to \(b \in B\) includes some vertex in \(S\). If \(S=\varnothing\), then \(A \perp_s B\) iff they are in disconnected components of \(\mathcal{G}\).

Example 2.5    \(\{1, 2\} \perp_s \{4\} \mid \{3\}\).

            graph LR
            1((1)) --- 3((3)) --- 4((4))
1 --- 2((2)) --- 3
          

Definition 2.6    Let \(W \subseteq V\), the induced subgraph over \(W\), denoted \(\mathcal{G}_W\), has vertices consisting of \(W\), and edges \(E_W=\{\{i, j\} \in E, i, j \in W\}\).

Example 2.6   

            graph LR
            1((1)) --- 3((3)) --- 4((4))
1 --- 2((2)) --- 3
          

\(\mathcal{G}_{\{1, 4\}}\) is

            graph TB
            1((1))
4((4))
          

\(\mathcal{G}_{\{1, 2, 3\}}\) is

            graph LR
            1((1)) --- 2((2)) --- 3((3)) --- 1
          

\(A\) and \(B\) are separated by \(S\) (where \(S \cap A=S \cap B=\varnothing\)) iff \(A\) and \(B\) are separated by \(\varnothing\) in \(\mathcal{G}_{V-S}\).

Definition 2.7    We say that \(p(x_V)\) obeying all the conditions of definition 2.1 is globally Markov w.r.t. \(\mathcal{G}\) if whenever \[A \perp_s B \mid S\ [\mathcal{G}] \Rightarrow X_A \perp\!\!\!\!\perp X_B \mid X_S\ [p].\]

Theorem 2.1    The global Markov property implies the pairwise Markov property.

Proof. Let \(i \not\sim j\) in \(\mathcal{G}\). Since any path from \(i\) to \(j\) must pass through some \(k \in V-\{i, j\}\), then \[i \perp_s j \mid V-\{i, j\}\ [\mathcal{G}] \Rightarrow X_i \perp\!\!\!\!\perp X_j \mid X_{V-\{i, j\}}\ [p].\]

\(\square\)

Theorem 2.2    If \(p(x_V)\) factorizes according to \(\mathcal{G}\), then \(p(x_V)\) is globally Markov w.r.t. \(\mathcal{G}\).

Proof. Suppose \(A \perp_s B \mid S\). Let \(\widetilde{A}=\{v \in V: \{v\} \not\perp_s A \mid S\}\) (i.e., any vertex in \(\widetilde{A}\) reaches vertex in \(A\) without going through \(S\)). Let \(\widetilde{B}=V-(\widetilde{A} \cup S)\) so that \(\widetilde{A} \perp_s \widetilde{B} \mid S\), \(V=\widetilde{A} \dot\cup \widetilde{B} \dot\cup S\), \(A \subseteq \widetilde{A}\), and \(B \subseteq \widetilde{B}\). Furthermore, every clique is a subset of either \(\widetilde{A} \cup S\) or \(\widetilde{B} \cup S\), since there are no edges between \(\widetilde{A}\) and \(\widetilde{B}\).

Let \(\mathcal{C}_A(\mathcal{G})=\{C \in \mathcal{C}(\mathcal{G}): C \subseteq \widetilde{A} \cup S\}\) and \(\mathcal{C}_B(\mathcal{G})=\mathcal{C}(\mathcal{G})-\mathcal{C}_A(\mathcal{G})\), then \[p(x_V)=\prod_{C \in \mathcal{C}(\mathcal{G})} \psi_C(x_C)=\prod_{C \in \mathcal{C}_A(\mathcal{G})} \psi_C(x_C) \cdot \prod_{D \in \mathcal{C}_B(\mathcal{G})} \psi_D(x_D)=f(x_{\widetilde{A}}, x_S)g(x_{\widetilde{B}}, x_S).\] Hence, \(X_{\widetilde{A}} \perp\!\!\!\!\perp X_{\widetilde{B}} \mid X_S\). Since \(A \subseteq \widetilde{A}\), and \(B \subseteq \widetilde{B}\), then by Graphoid axioms, \(X_A \perp\!\!\!\!\perp X_B \mid X_S\ [p]\). Hence \(p\) is globally Markov w.r.t. \(\mathcal{G}\).

\(\square\)

Theorem 2.3 (Hammersley-Clifford Theorem)    If \(p(x_V)>0\) has the pairwise Markov property w.r.t. \(\mathcal{G}\), then \(p\) factorizes according to \(\mathcal{G}\).

\(p \in M_p(\mathcal{G}) \Leftarrow p \in M_g(\mathcal{G}) \Leftarrow p \in M_f(\mathcal{G})\). If \(p>0\), then \(p \in M_f(\mathcal{G}) \Leftarrow p \in M_p(\mathcal{G})\).

3. Decomposability

Definition 3.1    Let \(\mathcal{G}\) be an undirected graph, with vertices \(V=A \dot\cup B \dot\cup S\). We say \((A, S, B)\) is a decomposition of \(\mathcal{G}\) if \(\mathcal{G}_S\) is complete, and \(A \perp_s B \mid S\ [\mathcal{G}]\). We say the decomposition is proper if \(A\) and \(B\) are non-empty.

Example 3.1    For \(A=\{1, 2\}\), \(B=\{4\}\) and \(S=\{3\}\), \((A, S, B)\) is a proper decomposition. For \(A=\{1\}\), \(B=\{4\}\) and \(S=\{2, 3\}\), \((A, S, B)\) is a proper decomposition.

            graph LR
            1((1)) --- 3((3)) --- 4((4))
1 --- 2((2)) --- 3
          

Example 3.2    The graph below has no proper decomposition.

            graph LR
            1((1)) --- 2((2))
1 --- 4((4)) --- 3((3))
2 --- 3
          

Definition 3.2    We say that \(\mathcal{G}\) is decomposable if it is either complete, or there is a proper decomposition \((A, S, B)\), and \(\mathcal{G}_{A \cup S}\) and \(\mathcal{G}_{B \cup S}\) are decomposable.

Example 3.3    For \(A=\{1, 2\}\), \(B=\{4\}\) and \(S=\{3\}\), \((A, S, B)\) is a proper decomposition. Both \(\mathcal{G}_{A \cup S}\) and \(\mathcal{G}_{B \cup S}\) are complete, then \(\mathcal{G}\) is decomposable.

            graph LR
            1((1)) --- 3((3)) --- 4((4))
1 --- 2((2)) --- 3
          

Example 3.4    For \(A=\{1, 2\}\), \(B=\{5, 6\}\) and \(S=\{3, 4\}\), \((A, S, B)\) is a proper decomposition. Since \(\mathcal{G}_{A \cup S}\) and \(\mathcal{G}_{B \cup S}\) are decomposable, then \(\mathcal{G}\) is decomposable.

            graph LR
            1((1)) --- 2((2))
2 --- 3((3)) --- 6((6))
2 --- 4((4)) --- 6((6))
3 --- 4
4 --- 5((5))
          

Definition 3.3    Let \(\mathcal{C}\) be a collection of subsets of \(V\). We say that \(\mathcal{C}\) has the running intersection property if we can order the elements \(C_1, \ldots, C_k\) s.t. for all \(j \geq 2\), \[C_j \cap \left(\bigcup_{i<j} C_i\right)=C_j \cap C_{\sigma(j)},\] where \(\sigma(j)<j\).

Example 3.5    \(\{1, 2\}\), \(\{2, 3, 4\}\), \(\{3, 4, 6\}\) and \(\{4, 5\}\) satisfy the running intersection property.

\(C\)
\(\{1, 2\}\)
\(\{2, 3, 4\}\)
\(\{3, 4, 6\}\)
\(\{4, 5\}\)
Separator set \(S\)
\(-\)
\(\{2\}\)
\(\{3, 4\}\)
\(\{4\}\)
\(\sigma(j)\)
\(1\)
\(2\)
\(2\) or \(3\)

Example 3.6    \(\{1, 2\}\), \(\{2, 3\}\), \(\{3, 4\}\) and \(\{1, 4\}\) do not satisfy the running intersection property.

\(C\)
\(\{1, 2\}\)
\(\{2, 3\}\)
\(\{3, 4\}\)
\(\{1, 4\}\)
Separator set \(S\)
\(-\)
\(\{2\}\)
\(\{3\}\)
\(\{1, 4\}\)
\(\sigma(j)\)
\(1\)
\(2\)
N/A

Definition 3.4    Let \(\mathcal{G}\) be an undirected graph. A cycle is a sequence of vertices \(\{v_1, v_2, \ldots, v_k\}\) for \(k \geq 3\) s.t. \(v_1-\ldots-v_k\) and \(v_1 \sim v_k\). A chord on a cycle is an edge between two vertices that are not adjacent on the cycle. A graph is triangulated if it does not contain any chordless cycles of length at least \(4\).

Theorem 3.1    Let \(\mathcal{G}\) be an undirected graph. The following are equivalent:

    (1) \(\mathcal{G}\) is decomposable;

    (2) \(\mathcal{G}\) is triangulated;

    (3) every minimal \((a, b)\)-separator is complete in \(\mathcal{G}\) (i.e., \(a \perp_s b \mid S \Rightarrow a \not\perp_s b \mid T \subset S\));

    (4) \(\mathcal{C}(\mathcal{G})\) satisfies the running intersection property.

Proof. We need to show (1) \(\Rightarrow\) (2) \(\Rightarrow\) (3) \(\Rightarrow\) (4) \(\Rightarrow\) (1).

(i) Suppose \(\mathcal{G}\) is decomposable. Let \(p=|V|\). If \(\mathcal{G}\) is complete, then it is clearly triangulated, so the result holds for \(p=1\). If \(\mathcal{G}\) is not complete, then by definition, there exists a proper decomposition \((A, S, B)\) so that \(\mathcal{G}_{A \cup S}\) and \(\mathcal{G}_{B \cup S}\) have strictly fewer vertices than \(\mathcal{G}\) and are decomposable. By induction hypothesis, \(\mathcal{G}_{A \cup S}\) and \(\mathcal{G}_{B \cup S}\) are triangulated. Suppose there is a cycle containing \(a \in A\) and \(b \in B\), then the cycle passes twice through \(S\), which is complete, then there is a chord on the cycle. Hence, \(\mathcal{G}\) is triangulated.

(ii) Suppose \(\mathcal{G}\) is triangulated and \(S\) separates \(a\) and \(b\) minimally. If \(S\) is not complete, then there exist \(s_1, s_2 \in S\) s.t. \(s_1 \not\sim s_2\). Let \(\pi_1\) be the path from \(a\) to \(b\) via \(s_1 \in S\), and \(\pi_2\) be the path from \(a\) to \(b\) via \(s_2 \in S\), and neither these paths intersects any other element of \(S\). Let \(a'\) be the vertex on \(\pi_1\), which is closest to \(s_1\) on the same side as \(a\), and is also on \(\pi_2\). We reduce \(\pi_1\) to start at \(a'\), and the same for \(\pi_2\). We repeat it on the side of \(b\). Hence, we have a chordless cycle of length at least \(4\), which is a contradiction.

(iii) If the graph is complete, there is nothing to proof since \(\mathcal{C}(\mathcal{G})=V\). Let \(p=|V|\), the the result holds trivially for \(p=1\).

Pick \(a \not\sim b\) with a minimal separator \(S\), which is complete. Let \(\widetilde{A}=\{v \in V: v \not\perp_s a \mid S\}\), and \(\widetilde{B}=V-(\widetilde{A} \cup S)\). Note that \(\widetilde{A} \perp_s \widetilde{B}\mid S\), \(V=\widetilde{A} \dot\cup \widetilde{B} \dot\cup S\). Since \(a \in \widetilde{A}\) and \(b \in \widetilde{B}\), then \(\widetilde{A} \neq \varnothing\) and \(\widetilde{B} \neq \varnothing\). Hence, \((\widetilde{A}, S, \widetilde{B})\) is a proper decomposition. Therefore, \(\mathcal{G}_{\widetilde{A} \cup S}\) and \(\mathcal{G}_{\widetilde{B} \cup S}\) have strictly fewer vertices and are decomposable.

By induction hypothesis, \(\mathcal{C}(\mathcal{G}_{\widetilde{A} \cup S})\) and \(\mathcal{C}(\mathcal{G}_{\widetilde{B} \cup S})\) satisfy the running intersection property. The set \(S\) is complete in both subgraphs, so there is some clique \(D\) in \(\mathcal{G}_{\widetilde{A} \cup S}\) that contains \(S\). In addition, each clique in \(\mathcal{G}\) is a clique in one of the two subgraphs. Hence, if we order the cliques to satisfy running intersection for \(\mathcal{G}_{\widetilde{A} \cup S}\) and \(\mathcal{G}_{\widetilde{B} \cup S}\) respectively, and being sure to start with the clique of \(\mathcal{G}_{\widetilde{B} \cup S}\) that contains \(S\), then together they will satisfy running intersection for \(\mathcal{G}\).

(iv) Suppose \(C_1, \ldots, C_k\) satisfy the running intersection property. If \(k=1\), then the graph is complete, and thus is decomposable.

Let \[H_k=\bigcup_{i<k} C_i \quad \text{and} \quad S_k=C_k \cap H_k=C_k \cap C_{\sigma(j)}.\] Since \(C_k\) and \(C_{\sigma(j)}\) are complete, then \(S_k \subseteq C_k\) is complete.

Suppose there exists an edge connecting \(a \in C_k-S_k\) and \(b \in H_k-S_k\), then \(\{a, b\}\) must be a subset of some cliques. But \(\{a, b\} \not\subseteq C_k\) since \(b \notin C_k\) (as \(b \in H_k-S_k=H_k \cap S_k^C\), then \(b \in H_k\) and \(b \in S_k^C=C_k^C \cup H_k^C\); since \(b \notin H_k^C\), then \(b \in C_k^C\)), and \(\{a, b\} \not\subseteq C_j\) for \(j<k\) since \(a \notin H_k\). Therefore, the edge cannot exist and thus \(C_k-S_k \perp_s H_k-S_k \mid S_k\).

Note that \(H_k-S_k=H_k \cap C_k^C \neq \varnothing\) since \(H_k=C_1 \cup \cdots \cup C_{k-1}\) and \(C_k \neq V\).

If \(C_k-S_k=\varnothing\), where \(S_k \subseteq C_j\) for some \(j<k\), then \(C_k-C_j=C_k \cap C_j^C=\varnothing\). By assumption, \(C_i \neq \varnothing\), then \(C_j=C_k\) or \(C_j=V\). However, \(C_j \neq C_k\) since cliques as elements in a set cannot repeat; \(C_j \neq V\) since if the graph is complete, then there is only one clique in \(\mathcal{C}(\mathcal{G}_S)\). Hence, \(C_k-S_k \neq \varnothing.\)

Therefore, \((H_k-S_k, S_k, C_k-S_k)\) is a proper decomposition of \(\mathcal{G}\).

Since \(\mathcal{G}_{C_k}\) is decomposable (\(C_k\) is complete), and \(\mathcal{G}_{H_k}\) is decomposable (by induction hypothesis), then \(\mathcal{G}\) is decomposable (by definition).

\(\square\)

Corollary 3.2    If \((A, S, B)\) is a decomposition, and \(\mathcal{G}\) is decomposable, then \(\mathcal{G}_{A \cup S}\) and \(\mathcal{G}_{B \cup S}\) are also decomposable.

Proof. Since \(\mathcal{G}\) is decomposable, then \(\mathcal{G}\) is triangulated. Therefore, \(\mathcal{G}_{A \cup S}\) and \(\mathcal{G}_{B \cup S}\) are triangulated, which implies \(\mathcal{G}_{A \cup S}\) and \(\mathcal{G}_{B \cup S}\) are decomposable.

\(\square\)

Corollary 3.2 reassures us that if we want to check a graph is decomposable, we can just go ahead and start decomposing, and we will never have to back track.

Definition 3.5    Let \(C_1, \ldots, C_k\) be the cliques of \(\mathcal{G}\) ordered according to running intersection. The \(i\)th separator set is \[S_i=C_i \cap \bigcup_{j<i} C_j,\] where by convention, \(S_1=\varnothing\).

Lemma 3.3    Let \(\mathcal{G}\) be a graph with a decomposition \((A, S, B)\), and \(p\) be a distribution. Then \(p\) factorizes according to \(\mathcal{G}\) iff its margins \(p(x_A, x_S)\) and \(p(x_B, x_S)\) factorize according to \(\mathcal{G}_{A \cup S}\) and \(\mathcal{G}_{B \cup S}\), and \[p(x_V)p(x_S)=p(x_A, x_S)p(x_B, x_S).\]

Proof. \((\Leftarrow)\) Suppose \(p(x_V)p(x_S)=p(x_A, x_S)p(x_B, x_S)\). Since every clique in \(\mathcal{G}_{A \cup S}\) or \(\mathcal{G}_{B \cup S}\) is a clique in \(\mathcal{G}\) and \(S\) is complete, then we have \[\begin{aligned} p(x_V)&=p(x_A, x_S)p(x_B, x_S)\frac{1}{p(x_S)} \\&=\prod_{C \in \mathcal{C}(\mathcal{G}_{A \cup S})} \psi_C(x_C) \prod_{C' \in \mathcal{C}(\mathcal{G}_{B \cup S})} \psi_{C'}(x_{C'})\frac{1}{p(x_S)} \\&=\prod_{C \in \mathcal{C}(\mathcal{G})} \widetilde{\psi}_C(x_C). \end{aligned}\] Hence, \(p\) factorizes according to \(\mathcal{G}\).

\((\Rightarrow)\) Suppose \(p \in M_f(\mathcal{G})\), then \(p \in M_g(\mathcal{G})\). Hence, \[p(x_V)p(x_S)=p(x_A, x_S)p(x_B, x_S).\] Besides, \[\begin{aligned} p(x_V)&=\prod_{C \in \mathcal{C}(\mathcal{G})} \psi_C(x_C) \\&=\prod_{C \in \mathcal{C}_A(\mathcal{G})} \psi_C(x_C) \prod_{C' \in \mathcal{C}_B(\mathcal{G})} \psi_{C'}(x_{C'}), \end{aligned}\] where \(C_A(\mathcal{G})\) is the cliques contained in \(A \cup S\), and \(C_B(\mathcal{G})\) is the rest. Then \[\begin{aligned} p(x_B, x_S)&=\prod_{C' \in \mathcal{C}_B(\mathcal{G})} \psi_{C'}(x_{C'}) \int \prod_{C \in \mathcal{C}_A(\mathcal{G})} \psi_C(x_C)\text{d}x_A \\&=\prod_{C' \in \mathcal{C}(\mathcal{G}_{B \cup S})} \widetilde{\psi}_{C'}(x_{C'}) \cdot f(x_S) \\&=\prod_{C \in \mathcal{C}(\mathcal{G}_{B \cup S})} \widehat{\psi}_C(x_C). \end{aligned}\] Therefore, \(p(x_B, x_S)\) factorizes according to \(\mathcal{G}_{B \cup S}\). Similarly, \(p(x_A, x_S)\) factorizes according to \(\mathcal{G}_{A \cup S}\).

\(\square\)

Theorem 3.4    Let \(\mathcal{G}\) be a decomposable graph with cliques \(C_1, \ldots, C_k\). Then \(p(x_V)\) factorizes according to \(\mathcal{G}\) iff for all \(x_V\), \[p(x_V)=\prod_{i=1}^k p(x_{C_i-S_i} \mid x_{S_i})=\prod_{i=1}^k \frac{p(x_{C_i})}{p(x_{S_i})}.\] Furthermore, the quantities \(p(x_{C_i-S_i} \mid x_{S_i})\) are variation independent (i.e., they may jointly take any set of values that would be valid individually), so inference for \(p(x_V)\) can be based on separate inferences for each \(p(x_{C_i})\).

Proof. \((\Leftarrow)\) Since \[p(x_V)=\prod_{i=1}^k \frac{p(x_{C_i})}{p(x_{S_i})}:=\prod_{C \in \mathcal{C}(\mathcal{G})} \psi_C(x_C),\] then \(p\) factorizes according to \(\mathcal{G}\).

\((\Rightarrow)\) If \(k=1\), then \(C_1=V\), i.e., \(p(x_V)=p(x_{C_1})\). Let \[H_k=\left(\bigcup_{i<k} C_i\right)-S_k,\] we have proven that \(C_k-S_k \perp_s H_k \mid S_k\), and \((H_k, S_k, C_k-S_k)\) is a proper decomposition.

The graph \(\mathcal{G}_{H_k \cup S_k}\) has \(k-1\) cliques, and by lemma 3.3, \(p(x_{H_k}, x_{S_k})\) factorizes according to \(\mathcal{G}_{H_k \cup S_k}\). Then by induction hypothesis \[p(x_{H_k}, x_{S_k})=\prod_{i=1}^{k-1} p(x_{C_i-S_i} \mid x_{S_i}).\] By lemma 3.3, \(p(x_V)p(x_{S_k})=p(x_{H_k}, x_{S_k})p(x_{C_K})\), then \[p(x_V)=\prod_{i=1}^{k-1} p(x_{C_i-S_i} \mid x_{S_i})p(x_{C_k-S_k} \mid x_{S_k}).\] The variation independence follows from the fact that \(p(x_{C_k-S_k} \mid x_{S_k})\) can take the form of any valid probability distribution.

\(\square\)

4. Multinomial Model

Theorem 3.4 is useful for statistical inference, since we only need to consider the margins of variables corresponding to cliques. Suppose we have a contingency table with counts \(n(x_V)\), the log likelihood for a decomposable graph is \[\begin{aligned}l(p; n)&=\sum_{x_V \in \mathcal{X}_V} n(x_V) \ln p(x_V)\\&=\sum_{x_V \in \mathcal{X}_V} n(x_V)\sum_{i=1}^k \ln p(x_{C_i-S_i} \mid x_{S_i})\\&=\sum_{i=1}^k \sum_{x_{C_i} \in \mathcal{X}_{C_i}} n(x_{C_i})\ln p(x_{C_i-S_i} \mid x_{S_i}).\end{aligned}\] Using Lagrange multiplier, we can see that the likelihood is maximized by choosing \[\widehat{p}(x_{C_i-S_i} \mid x_{S_i})=\frac{n(x_{C_i})}{n(x_{S_i})} \Rightarrow \widehat{p}(x_{C_i})=\frac{n(x_{C_i})}{n}.\]

Theorem 4.1    Let \(\mathcal{G}\) be an undirected graph with counts \(n(x_V)\). Then the MLE under \(\mathcal{G}\) is the unique distribution for which \[n\widehat{p}(x_C)=n(x_C)\] for all cliques \(C \in \mathcal{C}(\mathcal{G})\).

4.1. Iterative Proportional Fitting (IPF) Algorithm

The iterative proportional fitting (IPF) algorithm, a.k.a. the iterative proportional scaling (IPS) algorithm, starts with a discrete distribution that satisfies the Markov property for the graph \(\mathcal{G}\) (usually we pick the uniform distribution so that everything is independent), and then iteratively fixes each margin \(p(x_C)\) to match the required distribution (or match moment) using the update step.

Algorithm 1 Iterative Proportional Fitting (IPF) Algorithm.
    function IPF(collection of consistent margins \(q(x_{C_i})\) for sets \(C_1, \ldots, C_k\))
        set \(p(x_V)\) to uniform distribution;
        while \(\max_i \max_{x_{C_i}} |p(x_{C_i})-q(x_{C_i})|>\text{tol}\) do
            for \(i\) in \(1, \ldots, k\) do
                update \(p(x_V)\) to \(p(x_{V-C_i} \mid x_{C_i})q(x_{C_i})\);
            end for
        end while
        return distribution \(p\) with margins \(p(x_{C_i}) \approx q(x_{C_i})\)
    end function

Theorem 4.2    The IPF algorithm does not decrease the likelihood at any iteration, and converges to a solution.

Proof. Pick a margin \(A\), and let \(B=V-A\). The algorithm replaces \[p^{(t)}(x_A, x_B)=p^{(t)}(x_A)p^{(t)}(x_B \mid x_A) \mapsto q(x_A)p^{(t)}(x_B \mid x_A).\] The log likelihood is \[l(p; n)=\sum_{x_V} n(x_V)\ln p^{(t)}(x_A, x_B)=\sum_{x_V} n(x_V)\ln p^{(t)}(x_A)+\sum_{x_V} n(x_V)\ln p^{(t)}(x_B \mid x_A)\] where the first term is maximized by the IPF algorithm, while the second term is unchanged.

Since log likelihood function is concave, then it converges to maximum.

\(\square\)

The algorithm is closely related to the message passing algorithm.