0%

Importance Sampling

1. Importance Sampling

Suppose we want to compute \[I=\mathbb{E}_\pi[\varphi(X)]=\int_\mathbb{X} \varphi(x)\pi(x)\text{d}x\] where we do not know how to sample from the target \(\pi\) but have access to a proposal or importance distribution of density \(q\), satisfying \(\pi(x)>0 \Rightarrow q(x)>0\), i.e. the support of \(q\) includes the support of \(\pi\). Then we have \[I=\mathbb{E}_\pi[\varphi(X)]=\mathbb{E}_q[\varphi(X)w(X)]\] where \(w: \mathbb{X} \to \mathbb{R}^+\) is the importance weight function \[w(x)=\frac{\pi(x)}{q(x)}.\] Hence for \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} q\), \[\widehat{I}_n^\text{IS}=\frac{1}{n}\sum_{i=1}^n \varphi(X_i)w(X_i).\]

We have \[\begin{aligned}\mathbb{E}_\pi[\varphi(X)]&=\int_\mathbb{X} \varphi(x)\pi(x)\text{d}x\\&=\int_\mathbb{X} \varphi(x)\frac{\pi(x)}{q(x)}q(x)\text{d}x\\&=\int_\mathbb{X} \varphi(x)w(x)q(x)\text{d}x\\&=\mathbb{E}_q[\varphi(X)w(X)].\end{aligned}\]

1.1. Property

Proposition 1.1    Importance sampler satisfies:

    (1) (Unbiasedness) \(\mathbb{E}_q[\widehat{I}_n^\text{IS}]=I\);

    (2) (Strong Consistency) if \(\mathbb{E}_q[|\varphi(X)|w(X)]<\infty\), then \(\displaystyle \lim_{n \to \infty} \widehat{I}_n^\text{IS}=I\ \text{a.s.}\);

    (3) (CLT) \(\displaystyle \mathbb{V}_q[\widehat{I}_n^\text{IS}]=\frac{\sigma_\text{IS}^2}{n}\) where \(\sigma_\text{IS}^2=\mathbb{V}_q[\varphi(X)w(X)]\); if \(\sigma_\text{IS}^2<\infty\) then \(\displaystyle \lim_{n \to \infty} \sqrt{n}(\widehat{I}_n^\text{IS}-I) \overset{\text{D}}{\to} \mathcal{N}(0, \sigma_\text{IS}^2)\).

Proposition 1.2    The optimal proposal minimizing \(\mathbb{V}_q[\widehat{I}_n^\text{IS}]\) is given by \[q_\text{opt}(x)=\frac{|\varphi(x)|\pi(x)}{\int_\mathbb{X} |\varphi(x)|\pi(x)\text{d}x}.\]

Proof. We have \[\sigma_\text{IS}^2=\mathbb{V}_q[\varphi(X)w(X)]=\mathbb{E}_q[\varphi^2(X)w^2(X)]-I^2.\] By Jensen's inequality, we have \[\mathbb{E}_q[\varphi^2(X)w^2(X)]=\mathbb{E}_q[(|\varphi(X)|w(X))^2] \geq (\mathbb{E}_q[|\varphi(X)|w(X)])^2=\left(\int_\mathbb{X} |\varphi(x)|\pi(x)\text{d}x\right)^2.\] For \(q=q_\text{opt}\), we have \[\begin{aligned} \mathbb{E}_{q_\text{opt}}[\varphi^2(X)w^2(X)]&=\int_\mathbb{X} \varphi^2(x)w^2(x)q_\text{opt}(x)\text{d}x \\&=\int_\mathbb{X} \varphi^2(x)\frac{\pi^2(x)}{q_\text{opt}(x)}\text{d}x \\&=\int_\mathbb{X} \frac{\varphi^2(x)\pi^2(x)}{|\varphi(x)|\pi(x)}\text{d}x \int_\mathbb{X} |\varphi(x)|\pi(x)\text{d}x \\&=\left(\int_\mathbb{X} |\varphi(x)|\pi(x)\text{d}x\right)^2. \end{aligned}\] Therefore, \(q_\text{opt}(x)\) minimizes \(\mathbb{V}_q[\widehat{I}_n^\text{IS}]\).

\(\square\)

\(q_\text{opt}(x)\) can NEVER be used in practice.

For \(\varphi(x)>0\), we have \(\displaystyle q_\text{opt}(x)=\frac{\varphi(x)\pi(x)}{I}\) and \(\mathbb{V}_{q_\text{opt}}[\widehat{I}_n^\text{IS}]=0\) but \[\varphi(x)w(x)=\varphi(x)\frac{\pi(x)}{q_\text{opt}(x)}=I\] i.e. using \(q_\text{opt}(x)\) requires knowing \(I\).

2. Normalized Importance Sampling

Standard IS has limited applications in statistics as it requires knowing \(\pi(x)\) and \(q(x)\) exactly. Assume \[\pi(x)=\frac{\widetilde{\pi}(x)}{Z_\pi} \quad \text{and} \quad q(x)=\frac{\widetilde{q}(x)}{Z_q}\] where \(\pi(x)>0 \Rightarrow q(x)>0\). Define \[\widetilde{w}(x)=\frac{\widetilde{\pi}(x)}{\widetilde{q}(x)},\] then we have \[I=\mathbb{E}_\pi[\varphi(X)]=\frac{\int_\mathbb{X} \varphi(x)\widetilde{w}(x)q(x)\text{d}x}{\int_\mathbb{X} \widetilde{w}(x)q(x)\text{d}x}.\]

We have \[\begin{aligned}\mathbb{E}_\pi[\varphi(X)]&=\int_\mathbb{X} \varphi(x)\pi(x)\text{d}x\\&=\frac{\int_\mathbb{X} \varphi(x)\frac{\pi(x)}{q(x)}q(x)\text{d}x}{\int_\mathbb{X} \frac{\pi(x)}{q(x)}q(x)\text{d}x}\\&=\frac{\int_\mathbb{X} \varphi(x)\frac{\widetilde{\pi}(x)/Z_\pi}{\widetilde{q}(x)/Z_q}q(x)\text{d}x}{\int_\mathbb{X} \frac{\widetilde{\pi}(x)/Z_\pi}{\widetilde{q}(x)/Z_q}q(x)\text{d}x}\\&=\frac{\int_\mathbb{X} \varphi(x)\widetilde{w}(x)q(x)\text{d}x}{\int_\mathbb{X} \widetilde{w}(x)q(x)\text{d}x}\\&=\frac{\mathbb{E}_q[\varphi(X)\widetilde{w}(X)]}{\mathbb{E}_q[\widetilde{w}(X)]}.\end{aligned}\]

2.1. Property

Proposition 2.1 (Strong Consistency)    Let \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} q\) and assume that \(\mathbb{E}_q[|\varphi(X)|w(X)]<\infty\). Then \[\widehat{I}_n^\text{NIS}=\frac{\sum_{i=1}^n \varphi(X_i)\widetilde{w}(X_i)}{\sum_{i=1}^n \widetilde{w}(X_i)}\] is strongly consistent.

Proof. We have \[\widehat{I}_n^\text{NIS}=\frac{\sum_{i=1}^n \varphi(X_i)\widetilde{w}(X_i)}{\sum_{i=1}^n \widetilde{w}(X_i)}=\frac{n^{-1}\sum_{i=1}^n \varphi(X_i)\widetilde{w}(X_i)}{n^{-1}\sum_{i=1}^n \widetilde{w}(X_i)} \xrightarrow{\text{a.s.}} \frac{\mathbb{E}_q[\varphi(X)\widetilde{w}(X)]}{\mathbb{E}_q[\widetilde{w}(X)]}=I.\]

\(\square\)

Proposition 2.2 (CLT)    If \(\mathbb{V}_q[\varphi(X)w(X)]<\infty\) and \(\mathbb{V}_q[w(X)]<\infty\), then \[\sqrt{n}(\widehat{I}_n^\text{NIS}-I) \overset{\text{D}}{\to} \mathcal{N}(0, \sigma_\text{NIS}^2)\] where \[\sigma_\text{NIS}^2=\mathbb{V}_q[w(X)(\varphi(X)-I)]=\int_\mathbb{X} \frac{\pi^2(x)(\varphi(x)-I)^2}{q(x)}\text{d}x.\]

Proof. With \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} q\), we have \[\sqrt{n}(\widehat{I}_n^\text{NIS}-I)=\frac{n^{-1/2}\sum_{i=1}^n \widetilde{w}(X_i)(\varphi(X_i)-I)}{n^{-1}\sum_{i=1}^n \widetilde{w}(X_i)}.\]

Since \[\mathbb{E}_q[\widetilde{w}(X)(\varphi(X)-I)]=\mathbb{E}_q[\widetilde{w}(X)\varphi(X)]-\frac{\mathbb{E}_q[\varphi(X)\widetilde{w}(X)]}{\mathbb{E}_q[\widetilde{w}(X)]}\mathbb{E}_q[\widetilde{w}(X)]=0\] and \(\mathbb{V}_q[\varphi(X)w(X)]<\infty\), then by CLT, \[\sum_{i=1}^n \widetilde{w}(X_i)(\varphi(X_i)-I) \overset{\text{D}}{\to} \mathcal{N}(0, n\mathbb{V}_q[\widetilde{w}(X)(\varphi(X)-I)],\] and thus \[\frac{1}{\sqrt{n}}\sum_{i=1}^n \widetilde{w}(X_i)(\varphi(X_i)-I) \overset{\text{D}}{\to} \mathcal{N}(0, \mathbb{V}_q[\widetilde{w}(X)(\varphi(X)-I)].\]

By SLLN, we have \[\frac{1}{n}\sum_{i=1}^n \widetilde{w}(X_i) \xrightarrow{\text{a.s.}} \mathbb{E}_q[\widetilde{w}(X)]=\frac{Z_\pi}{Z_q}.\]

Therefore, by Slutsky's theorem, we have \[\sqrt{n}(\widehat{I}_n^\text{NIS}-I)=\frac{n^{-1/2}\sum_{i=1}^n \widetilde{w}(X_i)(\varphi(X_i)-I)}{n^{-1}\sum_{i=1}^n \widetilde{w}(X_i)} \overset{\text{D}}{\to} \mathcal{N}\left(0, \frac{Z_q^2}{Z_\pi^2}\mathbb{V}_q[\widetilde{w}(X)(\varphi(X)-I)]\right)\] where \[\frac{Z_q^2}{Z_\pi^2}\mathbb{V}_q[\widetilde{w}(X)(\varphi(X)-I)]=\int_\mathbb{X} \frac{Z_q^2}{Z_\pi^2}\widetilde{w}^2(x)(\varphi(x)-I)^2q(x)\text{d}x=\int_\mathbb{X}\frac{\pi^2(x)(\varphi(x)-I)^2}{q(x)}\text{d}x=\sigma_\text{NIS}^2.\] Thus \[\sqrt{n}(\widehat{I}_n^\text{NIS}-I) \overset{\text{D}}{\to} \mathcal{N}(0, \sigma_\text{NIS}^2).\]

\(\square\)

3. Variance of Importance Sampling Estimator

Suppose \(X_1, \ldots, X_n \overset{\text{i.i.d.}}{\sim} q\), then \[\widehat{I}_n^\text{IS}=\frac{1}{n}\sum_{i=1}^n \varphi(X_i)w(X_i) \quad \text{and} \quad \widehat{I}_n^\text{NIS}=\frac{\sum_{i=1}^n \varphi(X_i)\widetilde{w}(X_i)}{\sum_{i=1}^n \widetilde{w}(X_i)}.\]

  • The asymptotic variance for \(\widehat{I}_n^\text{IS}\) is \[\mathbb{V}_\text{as}[\widehat{I}_n^\text{IS}]=\mathbb{E}_q[(\varphi(X)w(X)-\mathbb{E}_q[\varphi(X)w(X)])^2] \approx \frac{1}{n}\sum_{i=1}^n (\varphi(X_i)w(X_i)-\widehat{I}_n^\text{IS})^2.\]

  • The asymptotic variance for \(\widehat{I}_n^\text{NIS}\) is \[\mathbb{V}_\text{as}[\widehat{I}_n^\text{NIS}]=\frac{\mathbb{E}_q[(w(X)(\varphi(X)-I))^2]}{(\mathbb{E}_q[w(X)])^2} \approx \frac{n^{-1} \sum_{i=1}^n \widetilde{w}^2(X_i)(\varphi(X_i)-\widehat{I}_n^\text{NIS})^2}{(n^{-1}\sum_{i=1}^n \widetilde{w}(X_i))^2}.\]

4. Diagnostic

We solve for \(n_e\) in \[\frac{1}{n}\mathbb{V}_\text{as}[\widehat{I}_n^\text{NIS}]=\frac{\sigma^2}{n_e}\] where \(\displaystyle \frac{\sigma^2}{n_e}\) corresponds to the variance of an unweighted sample of size \(n_e\). The solution is \[n_e=\frac{(\sum_{i=1}^n \widetilde{w}(X_i))^2}{\sum_{i=1}^n \widetilde{w}^2(X_i)}\] which is called the effective sample size.