1. Definition
Definition 1.1 Let \(\{p_\theta: \theta \in \Theta\}\) be a collection of probability densities over \(\mathcal{X}\). We say that \(p\) forms an exponential family if it can be written as \[p_\theta(x)=\exp\left[\sum_{i=1}^p \theta_i\phi_i(x)-A(\theta)-C(x)\right].\] If \(\Theta\) is non-empty open set, we say that \(p_\theta\) is regular. The function \(\phi_i\) is the sufficient statistic, and the component \(\theta_i\) is the canonical parameter or natural parameter. We can replace the sum with an inner product of vectors \(\theta\) and \(\phi\): \[p_\theta(x)=\exp[\langle \theta, \phi(x) \rangle-A(\theta)-C(x)].\]
Note that \[1=\int_\mathcal{X} p_\theta(x)\text{d}x=\exp[-A(\theta)]\int_\mathcal{X} \exp[\langle \theta, \phi(x) \rangle-C(x)]\text{d}x.\] Hence, \[A(\theta)=\ln\int_\mathcal{X} \exp[\langle \theta, \phi(x) \rangle-C(x)]\text{d}x.\] The function \(A(\theta)\) is the cumulant function, and \(\exp[A(\theta)]\) is the partition function.
Theorem 1.1 We have \[\nabla_\theta A(\theta)=\mathbb{E}_\theta\phi(X) \quad \text{and} \quad \nabla\nabla^\top A(\theta)=\text{Cov}_\theta\phi(X).\] Consequently \(A(\theta)\) is convex, because \(\text{Cov}_\theta\phi(X) \succeq 0\).
Example 1.1 Let \(X \sim \text{Poisson}(\lambda)\), we have \[\ln f(x; \lambda)=-\lambda+x\ln\lambda-\ln x!,\] then \(\phi(x)=x\), \(\theta=\ln\lambda\), and \(A(\theta)=\lambda=e^\theta\). Then \[A'(\theta)=e^\theta=\mathbb{E}X\] and \[A''(\theta)=e^\theta=\text{Var}X.\]
2. Empirical Moment Matching
To find the MLE in an exponential family, we maximize the log-likelihood (ignoring \(C\), since it is constant in \(\theta\)). We have \[\begin{aligned} l(\theta)&=\sum_{i=1}^n \langle \theta, \phi(X_i) \rangle -nA(\theta)+C \\&=\left\langle \theta, \sum_{i=1}^n \phi(X_i) \right\rangle-nA(\theta)+C \\&=n\langle \theta, \overline{\phi(X)} \rangle-nA(\theta)+C. \end{aligned}\] Then \[\nabla_\theta l(\theta)=n\overline{\phi(X)}-n\nabla_\theta A(\theta).\] Let \(\nabla_\theta l(\theta)=0\), we have \[\nabla_\theta A(\theta)=\overline{\phi(X)}.\] Hence, we should choose \(\theta\) so that \(\mathbb{E}_\theta\phi(X)=\overline{\phi(X)}\). i.e., the mean of the sufficient statistics matches the empirical mean from the data.
Example 2.1 From example 1.1, we know that \(\overline{\phi(X)}=\overline{X}\) and the MLE is \(\widehat{\lambda}=\overline{X}\).
Example 2.2 Let \(X \sim \text{Binomial}(n, p)\) with \(n\) fixed. We have \[\ln f(x; p)=x \ln p+(n-x)\ln(1-p)+C=x\ln\frac{p}{1-p}+n\ln(1-p)+C,\] then \(\phi(x)=x\), \(\displaystyle \theta=\ln\frac{p}{1-p}\), and \(A(\theta)=-n\ln(1-p)=n\ln(1+e^\theta)\). Then \[A'(\theta)=\frac{ne^\theta}{1+e^\theta}=np\] and \[A''(\theta)=\frac{ne^\theta}{(1+e^\theta)^2}=np(1-p).\]
Since \(\overline{\phi(X)}=\overline{X}\) and thus the MLE is \(\displaystyle \widehat{p}=\overline{X}\).
3. Multivariate Gaussian Distribution
Definition 3.1 Let \(X_V=(X_1, \ldots, X_p)^\top \in \mathbb{R}^p\) be a random vector. Let \(\mu \in \mathbb{R}^p\) and \(\Sigma \in \mathbb{R}^{p \times p}\) be a positive definite symmetric matrix. We say that \(X_V\) has a multivariate Gaussian distribution or multivariate normal distribution with parameters \(\mu\) and \(\Sigma\) if for all \(x_V \in \mathbb{R}^p\), the joint density is \[\begin{aligned} f(x_V; \mu, \Sigma)&=\frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}}\exp\left[-\frac{1}{2}(x_V-\mu)^\top\Sigma^{-1}(x_V-\mu)\right] \\&=\frac{1}{(2\pi)^{p/2}} \exp\left[-\frac{1}{2}x_V^\top Kx_V+\mu^\top Kx_V-\frac{1}{2}\mu^\top K\mu+\frac{1}{2}\ln|K|\right], \end{aligned}\] where the concentration matrix is \(K=\Sigma^{-1}\).
Note that \[\begin{aligned}-\frac{1}{2}x_V^\top Kx_V+\mu^\top Kx_V-\frac{1}{2}\mu^\top K\mu+\frac{1}{2}\ln|K|&=-\frac{1}{2}\text{Tr}(x_V^\top Kx_V)+\mu^\top Kx_V-A(\theta)\\&=-\frac{1}{2}\text{Tr}(Kx_Vx_V^\top)+\mu^\top Kx_V-A(\theta)\\&=-\frac{1}{2}\text{vec}(K)^\top\text{vec}(x_Vx_V^\top)+\mu^\top Kx_V-A(\theta).\end{aligned}\] Hence, the canonical parameter is \(\begin{bmatrix} -\frac{1}{2}\text{vec}(K) \\ K\mu \end{bmatrix}\) and the sufficient statistic is \(\begin{bmatrix} \text{vec}(x_Vx_V^\top) \\ x_V \end{bmatrix}\).
4. Contingency Table
Definition 4.1 Consider multivariate systems of vectors \(X_V=(X_v: v \in V)\) for some set \(V=\{1, \ldots, p\}\), and \(X_A=(X_v: v \in A)\) for any \(A \subseteq V\). We assume that each \(X_v \in \mathcal{X}_v=\{1, \ldots, d_v\}\) (usually \(d_v=2\)). If we have \(n\) i.i.d. observations, then for \(i=1, \ldots, n\), \[X_V^{(i)}=(X_1^{(i)}, \ldots, X_p^{(i)})^\top.\] Define \[n(x_V)=\sum_{i=1}^n \mathbb{1}\{X_1^{(i)}=x_1, \ldots, X_p^{(i)}=x_p\},\] i.e., the number of individuals who have the response pattern \(x_V\).
A marginal table only counts some of the variables: \[n(x_A)=\sum_{i=1}^n \mathbb{1}\{X_a^{(i)}=x_a: a \in A\}=\sum_{x_{V-A}} n(x_A, x_{V-A}).\]
Example 4.1 Consider a contingency table:
|
|
||
---|---|---|---|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
Suppose \(x_V=(YE, NA)\), then \[n(x_V)=\sum_{i=1}^n \mathbb{1}\{X_1^{(i)}=YE, X_2^{(i)}=NA\}=28.\] Suppose \(A=\{1\}\) and \(x_A=(YE)\). then \[n(x_A)=\sum_{i=1}^n \mathbb{1}\{X_1^{(i)}=YE\}=46\] or \[n(x_A)=\sum_{x_2} n(x_A, x_2)=28+18=46.\]
4.1. Probabilistic Model
Suppose we have i.i.d. data, and we have \(P(X_V^{(i)}=x_V)=p(x_V)\) for each \(X_v \in \{1, \ldots, d_v\}\). The distribution of the counts is \[P(n(x_V): x_V \in \mathcal{X}_V)=\frac{n!}{\displaystyle \prod_{y_V \in \mathcal{X}_V}n(y_V)!}\prod_{y_V \in \mathcal{X}_V} p(y_V)^{n(y_V)}\] for all \(p\) s.t. \(\displaystyle \sum_{x_V} p(x_V)=1\).
Note that \[\begin{aligned} \ln P(n(x_V))&=C+\sum_{y_V} n(y_V) \ln p(y_V)\ \ \left(\sum_{x_V} p(x_V)=1\right) \\&=C+\sum_{y_V \neq 0_V} n(y_V) \ln\frac{p(y_V)}{p(0_V)}+n \ln p(0_V). \end{aligned}\] Then the multinomial model is an exponential family with \(\phi(\mathbf{n})=\mathbf{n}\) and \(\displaystyle \theta(x_V)=\ln\frac{p(x_V)}{p(0_V)}\). The MLE for multinomial is \(\displaystyle \widehat{\theta}(x_V)=\ln\frac{n(x_V)}{n(0_V)}\).
5. Log-Linear Model
Definition 5.1 Let \(p(x_V)>0\), then the log-linear parameter \(\lambda_A(x_A)\) for \(A \subseteq V\) is defined by \[\ln p(x_V)=\sum_{A \subseteq V} \lambda_A(x_A)=\lambda_\varnothing+\lambda_1(x_1)+\cdots+\lambda_V(x_V)\] s.t. identifiability constraint \(\lambda_A(x_A)=0\) if \(x_a=1\) for any \(a \in A\).
In the case of binary variables (i.e., each variable takes only two states, \(d_v=2\), or \(\mathcal{X}_v=\{1, 2\}\)), there is only one possible non-zero level for each log-linear parameter \(\lambda_A(x_A)\) which is when \(x_A=(2, \ldots, 2)\). In this case, we write \(\lambda_A=\lambda_A(2, \ldots, 2)\).
Example 5.1 Suppose \(\pi_{xy}=p(x, y)\), and \(X\) and \(Y\) are binary. Then \(\ln \pi_{11}=\lambda_\varnothing\), \(\ln \pi_{21}=\lambda_\varnothing+\lambda_X\), \(\ln \pi_{12}=\lambda_\varnothing+\lambda_Y\), and \(\ln \pi_{22}=\lambda_\varnothing+\lambda_X+\lambda_Y+\lambda_{XY}\). Then \[\lambda_{XY}=\ln\frac{\pi_{11}\pi_{22}}{\pi_{12}\pi_{21}}\] where \(\displaystyle \frac{\pi_{11}\pi_{22}}{\pi_{12}\pi_{21}}\) is the odds ratio between \(X\) and \(Y\).
Theorem 5.1 Let \(p>0\) be a discrete distribution. Then \(X_a \perp\!\!\!\!\perp X_b \mid X_{V-\{a, b\}}\) iff \(\lambda_{abC}=0\) for all \(C \subseteq V-\{a, b\}\).
Example 5.2 Suppose \(X_A \perp\!\!\!\!\perp X_B \mid X_C\), where \(V=A \cup B \cup C\), then \[p(x_C)p(x_A, x_B, x_C)=p(x_A, x_C)p(x_B, x_C),\] and \[\ln p(x_A, x_B, x_C)=\ln p(x_A, x_C)+\ln p(x_B, x_C)-\ln p(x_C).\] The log-linear expansion is \[\sum_{W \subseteq V} \lambda_W(x_W)=\sum_{W \subseteq A \cup C} \lambda_W^{AC}(x_W)+\sum_{W \subseteq B \cup C} \lambda_W^{BC}(x_W)-\sum_{W \subseteq C} \lambda_W^C(x_W).\]
Hence, \(\lambda_W(x_W)=\lambda_W^{AC}(x_W)\) for any \(W \subseteq A \cup C\) with \(W \cap A \neq \varnothing\), \(\lambda_W(x_W)=\lambda_W^{BC}(x_W)\) for any \(W \subseteq B \cup C\) with \(W \cap B \neq \varnothing\), and \(\lambda_W(x_W)=\lambda_W^{AC}(x_W)+\lambda_W^{BC}(x_W)-\lambda_W^C(x_W)\) for any \(W \subseteq C\).
Under this conditional independence, the log-linear parameters for \(p(x_V)\) are easily obtainable from those for \(p(x_A, x_C)\) and \(p(x_B, x_C)\).
HERE is the full demo of log-linear model.
5.1. Poisson-Multinomial Equivalence
Theorem 5.2 Let \(X_i \sim \text{Poisson}(\mu_i)\) independently, and let \(\displaystyle N=\sum_{i=1}^k X_i\). Then \[N \sim \text{Poisson}(\mu)\] and \[(X_1, \ldots, X_k)^\top \mid N=n \sim \text{Multinom}(n, (\pi_1, \ldots, \pi_k)^\top),\] where \(\displaystyle \mu=\sum_{i=1}^k \mu_i\) and \(\displaystyle \pi_i=\frac{\mu_i}{\mu}\).
Proof. The Poisson likelihood is \[\begin{aligned} L(\mu_1, \ldots, \mu_k; x_1, \ldots, x_k) &\propto \prod_{i=1}^k e^{-\mu_i}\mu_i^{x_i} \\&=\prod_{i=1}^k e^{-\mu\pi_i}\mu^{x_i}\pi_i^{x_i} \\&=\mu^{\sum_i x_i}e^{-\mu\sum_i \pi_i}\prod_{i=1}^k \pi_i^{x_i} \\&=\mu^ne^{-\mu}\prod_{i=1}^k \pi_i^{x_i} \\&=L(\mu; n) \cdot L(\pi_1, \ldots, \pi_k; x_1, \ldots, x_k \mid n). \end{aligned}\]\(\square\)
Multinomial model can be fitted as Poisson GLM.