0%

Junction Tree and Message Passing

1. Junction Tree

Definition 1.1    A junction tree is a connected undirected graph without cycles (a tree), which has vertex \(C_i\) that consists of subset of \(V\), and satisfies the property that if \(C_i \cap C_j=S\), then every vertex on the unique path from \(C_i\) to \(C_j\) contains \(S\).

Example 1.1    Given sets \(\{1, 2\}\), \(\{2, 3, 4\}\), \(\{2, 4, 5\}\), \(\{4, 6\}\), \(\{6, 7, 8\}\), we can build a junction tree:

            graph LR
            1((1, 2)) --- 2((2, 3, 4)) --- 3((4, 6))
2 --- 4((2, 4, 5))
3 --- 5((6, 7, 8))
          

Equally, we could use a different ordering: \(\{6, 7, 8\}\), \(\{4, 6\}\), \(\{2, 4, 5\}\), \(\{1, 2\}\), \(\{2, 3, 4\}\).

            graph TB
            1((1, 2))
2((2, 3, 4))
3((4, 6))
1 & 2 & 3 --- 4((2, 4, 5))
3 --- 5((6, 7, 8))
          

Theorem 1.1    If \(\mathcal{T}\) is a junction tree, then its vertices satisfy the running intersection property. Conversely, if \(C_1, \ldots, C_k\) satisfy the running intersection property, they can be arranged into a junction tree.

Proof. Suppose \(\mathcal{T}\) is a junction tree. Pick a leaf \(C_k\) and remove it to obtain \(\mathcal{T}^{-k}\). By the induction hypothesis, the remaining sets, \(C_1, \ldots, C_{k-1}\) satisfy the running intersection property. Pick \(C_{\sigma(k)}\) to be the unique neighbor of \(C_k\) in \(\mathcal{T}\). By the junction tree property, \(C_k \cap C_j \subseteq C_{\sigma(k)}\) and \(C_k \cap C_j \subseteq C_k\) for any \(j<k\), then \[C_k \cap \left(\bigcup_{i<k} C_i\right) \subseteq C_k \cap C_{\sigma(k)}.\] Since \(\displaystyle C_{\sigma(k)} \subseteq \bigcup_{i<k} C_i\), it is obvious that \[C_k \cap \left(\bigcup_{i<k} C_i\right)=C_k \cap C_{\sigma(k)}.\]

Suppose \(C_1, \ldots, C_k\) satisfy the running intersection property, and assume \(\{C_1, \ldots, C_{k-1}\}\) is a junction tree. Let \(C_k \sim C_{\sigma(k)}\). We know \(C_k \cap C_i \subseteq C_{\sigma(k)}\) for all \(i\). If \(C_k \cap C_i=\varnothing\), then every vertex from \(C_i\) to \(C_k\) contains \(\varnothing\). If \(C_k \cap C_i=S \neq \varnothing\), we know \(C_k \cap C_i \subseteq C_k \cap C_{\sigma(k)} \subseteq C_{\sigma(k)}\). Assume \(S \not\subseteq C_j\) for some \(j\), then \(C_i \cap C_{\sigma(k)} \subseteq C_{\sigma(k)} \not\subseteq C_j\), which is a contradction. Hence \(S \subseteq C_j\) for all \(j\), and thus \(\{C_1, \ldots, C_k\}\) is a junction tree.

\(\square\)

Corollary 1.2    If \(C_1, \ldots, C_k\) satisfy the running intersection property, then they satisfy the running intersection property starting with any node \(C_j\).

Definition 1.2    We associate each node \(C\) in junction tree with a potential \(\psi_C(x_C) \geq 0\), which is a function over the variables in the corresponding set. We say that two potentials are consistent if they have the same margin over their intersection, i.e., for \(\psi_C(x_C)\) and \(\psi_D(x_D)\), \[\sum_{x_{C-D}} \psi_C(x_C)=f(x_{C \cap D})=\sum_{x_{D-C}} \psi_D(x_D),\] where \(S=C \cap D\).

Theorem 1.3    Suppose \(C_1, \ldots, C_k\) satisfy the running intersection property with separator sets \(S_2, \ldots, S_k\), and let \[p(x_V)=\prod_{i=1}^k \frac{\psi_{C_i}(x_{C_i})}{\psi_{S_i}(x_{S_i})}\] for all \(x_V \in \mathcal{X}_V\) (\(S_1=\varnothing\) and \(\psi_{S_1}=1\), by convention). Then the potentials are all consistent iff \(\psi_{C_i}(x_{C_i})=p(x_{C_i})\) and \(\psi_{S_i}(x_{S_i})=p(x_{S_i})\) for all \(i=1, \ldots, k\).

Proof. (\(\Leftarrow\)) If \(\psi_{C_i}(x_{C_i})=p(x_{C_i})\), then \[\sum_{x_{C_i-C_j}} \psi_{C_i}(x_{C_i})=\sum_{x_{C_i-C_j}} p(x_{C_i})=p(x_{S_{ij}})=\sum_{x_{C_j-C_i}} p(x_{C_j})=\sum_{x_{C_j-C_i}} \psi_{C_j}(x_{C_j})\] for any \(C_i\) and \(C_j\), i.e., the potentials are all consistent.

(\(\Rightarrow\)) If \(k=1\), there is nothing to prove. Otherwise, let \(R_k=C_k-S_k\), then \[p(x_{V-R_k})=\sum_{x_{R_k}} p(x_V)=\prod_{i=1}^{k-1} \frac{\psi_{C_i}(x_{C_i})}{\psi_{S_i}(x_{S_i})} \cdot \frac{1}{\psi_{S_k}(x_{S_k})} \sum_{x_{R_k}} \psi_{C_k}(x_{C_k}).\] Since all potentials are consistent and \(\displaystyle S_k=C_k \cap \bigcup_{i<k} C_i\), we have \[\sum_{x_{R_k}} \psi_{C_k}(x_{C_k})=\sum_{x_{C_k-S_k}} \psi_{C_k}(x_{C_k})=\sum_{x_{S_k-C_k}} \psi_{S_k}(x_{S_k})=\sum_{x_\varnothing}\psi_{S_k}(x_{S_k})=\psi_{S_k}(x_{S_k})\] and thus \[p(x_{V-R_k})=\prod_{i=1}^{k-1} \frac{\psi_{C_i}(x_{C_i})}{\psi_{S_i}(x_{S_i})},\] i.e., \(p(x_{V-R_k})\) only has \(k-1\) cliques, then \(\psi_{C_i}(x_{C_i})=p(x_{C_i})\) and \(\psi_{S_i}(x_{S_i})=p(x_{S_i})\) for all \(i<k\) by the induction hypothesis.

By the running intersection property, \(S_k=C_k \cap C_{\sigma(k)}\), and hence by consistency \[\psi_{S_k}(x_{S_k})=\sum_{x_{C_{\sigma(k)}-S_k}} \psi_{C_{\sigma(k)}}(x_{C_{\sigma(k)}})=\sum_{x_{C_{\sigma(k)}-S_k}} p(x_{C_{\sigma(k)}})=p(x_{S_k}).\]

Then \[p(x_V)=p(x_{V-R_k})\frac{\psi_{C_k}(x_{C_k})}{\psi_{S_k}(x_{S_k})}=p(x_{V-R_k})\frac{\psi_{C_k}(x_{C_k})}{p(x_{S_k})},\] and thus \[p(x_{R_k} \mid x_{V-R_k})=\frac{\psi_{C_k}(x_{C_k})}{p(x_{S_k})},\] which only depends on \(x_{C_k}\), then \[p(x_{R_k} \mid x_{V-R_k})=p(x_{R_k} \mid x_{S_k})=\frac{\psi_{C_k}(x_{C_k})}{p(x_{S_k})},\] i.e., \(\psi_{C_k}(x_{C_k})=p(x_{C_k})\).

\(\square\)

The steps to forming a junction tree:

  1. Moralize.
  2. Drop directions.
  3. Triangulate (add edges to get a decomposable graph).

2. Message Passing

Definition 2.1    Suppose we have potentials \(\psi_C\), \(\psi_D\) and \(\psi_S\), where \(S=C \cap D\). Then passing a message from \(\psi_C\) to \(\psi_D\) involves:

  • \(\displaystyle \psi_S'(x_S)=\sum_{x_{C-S}}\psi_C(x_C)\) (margin of \(\psi_C\) over the variables in \(S\));

  • \(\displaystyle \psi_D'(x_D)=\psi_D(x_D)\frac{\psi_S'(x_S)}{\psi_S(x_S)}\).

We have three important properties:

  • \(\displaystyle \prod_{i=1}^k \frac{\psi_{C_i}(x_{C_i})}{\psi_{S_i}(x_{S_i})}\) is unchanged because \(\displaystyle \frac{\psi_D'(x_D)}{\psi_S'(x_S)}=\frac{\psi_D(x_D)}{\psi_S(x_S)}\);

  • \(\psi_{S}'\) and \(\psi_C\) are consistent since \(\displaystyle \sum_{x_{S-C}}\psi'_S(x_S)=\sum_{x_\varnothing}\psi'_S(x_S)=\psi'_S(x_S)=\sum_{x_{C-S}}\psi_C(x_C)\);

  • If \(\psi_D\) and \(\psi_S\) are consistent, so are \(\psi_D'\) and \(\psi_S'\), since \[\sum_{x_{D-S}} \psi_D'(x_D)=\frac{\psi_S'(x_S)}{\psi_S(x_S)} \sum_{x_{D-S}} \psi_D(x_D)=\psi_S'(x_S).\]

Hence, updating preserves the joint distribution and does not upset margins that are already consistent.

3. Junction Tree Algorithm

Given a tree, we can pick any vertex as a root, and direct all edges away from it.

Algorithm 1 Collect and Distribute Steps of the Junction Tree Algorithm.
    function Collect(rooted tree \(\mathcal{T}\), potentials \(\psi_t\))
        let \(1 < \cdots < k\) be a topological ordering of \(\mathcal{T}\);
        for \(t\) in \(k, \ldots, 2\) do
            send message from \(\psi_t\) to \(\psi_{\sigma(t)}\);
        end for
        return updated potentials \(\psi_t\)
    end function
    function Distribute(rooted tree \(\mathcal{T}\), potentials \(\psi_t\))
        let \(1 < \cdots < k\) be a topological ordering of \(\mathcal{T}\);
        for \(t\) in \(k, \ldots, 2\) do
            send message from \(\psi_{\sigma(t)}\) to \(\psi_t\);
        end for
        return updated potentials \(\psi_t\)
    end function

Theorem 3.1    Let \(\mathcal{T}\) be a junction tree with potentials \(\psi_{C_i}(x_{C_i})\). Then after running the junction tree algorithm, all potentials will be consistent.

Proof. Since we pass a message from \(\psi_{C_i}\) to \(\psi_{C_{\sigma(i)}}\) and by the property of message passing, \(\psi_{C_i}\) and \(\psi_{S_i}\) are consistent. By property, the consistency between \(\psi_{C_i}\) and \(\psi_{S_i}\) is preserved after future updates from \(\psi_{C_{\sigma(i)}}\). Similarly, distribution ensures \(\psi_{S_i}\) and \(\psi_{C_{\sigma(i)}}\) are consistent. Hence, all pairs of potentials are consistent.

\(\square\)

In practice, message passing is often done in parallel and we have the convergence after at most \(d\) iterations, where \(d\) is the length of the longest path of the tree.

Junction graphs in general (loopy belief propagation) also seem to lead to convergence, but it is still an open area of study.

4. Evidence

Denote evidence set \(E\), then \[p(x_{V-E} \mid x_E=x_E^*)=\frac{p(x_{V-E}, x_E^*)}{p(x_E^*)}.\] In junction tree situation, \[p(x_V)=\prod_i \frac{\psi_{C_i}(x_{C_i})}{\psi_{S_i}(x_{S_i})}.\] If there is a clique that contains the node \(E\) (\(E \subseteq C_i\)), then we replace \(\psi_{C_i}(x_{C_i})\) with \(\displaystyle \frac{\psi_{C_i}(x_{C_i})}{p(x_E^*)}\). After running the junction tree algorithm, we have \(\psi_{C_i}(x_{C_i})=p(x_{C_i} \mid x_E^*)\).