0%

Monte Carlo Integration

1. Monte Carlo Integration

For \(f: \mathbb{X} \to \mathbb{R}\), we write \[I=\int_\mathbb{X} f(x)\text{d}x=\int_\mathbb{X} \varphi(x)\pi(x)\text{d}x=\mathbb{E}_\pi[\varphi(X)]\] where \(\pi\) is a probability density function on \(\mathbb{X}\), called target distribution, and \(\displaystyle \varphi(x)=\frac{f(x)}{\pi(x)}\). The Monte Carlo method is to sample \(X_1, \ldots, X_n \overset{\text{i.i.d.}}\sim \pi\) and compute \[\widehat{I}_n=\frac{1}{n}\sum_{i=1}^n \varphi(X_i).\]

The Monte Carlo estimator is unbiased, strongly consistent (\(\widehat{I}_n \to I\) a.s.) by the law of large numbers (LLN), and by the central limit theorem (CLT), the random approximation error is \(\mathcal{O}(n^{-1/2})\) whatever the dimension of the state space \(\mathbb{X}\).

Monte Carlo approach relies on \(X \sim \pi\) and so using Monte Carlo method to approximate \(\mathbb{E}_\pi[\varphi(X)]\) is equivalent to using simulation method to sample \(\pi\). Thus Monte Carlo sometimes refers to simulation method.

2. Bayesian Statistics

In statistics, the data is usually a collection of \(n\) values \((y_1, \ldots, y_n) \in \mathcal{Y}^n\) in some space \(\mathcal{Y}\), typically \(\mathbb{R}^{d_y}\) for some \(d_y\). A statistical model consider the data to be realization of r.v.s. \((Y_1, \ldots, Y_n)\) defined on the same space. Denote \((Y_1, \ldots, Y_n)\) by \(Y\), and \((y_1, \ldots, y_n)\) by \(y\). The distribution of r.v.s., which is specified by the model, has a density written \(p_Y(y; \theta)\) w.r.t. a dominating measure, where \(\theta\) is the parameter of the model in some space \(\Theta\). The density of the observations, seen as a function of the parameter, is called the likelihood and denoted by \[\mathcal{L}_n: \theta \in \Theta \mapsto p_Y(y; \theta).\]

In the frequentist approach, \(\theta\) is an unknown fixed value and inference is performed based on the likelihood function.

In the Bayesian approach, the \(\theta\) is regarded as a r.v. \(\vartheta\), where a prior probability distribution is assigned of density \(p_\vartheta(\theta)\) w.r.t. a dominating measure denoted \(\text{d}\theta\) (Lebesgue if \(\Theta=\mathbb{R}^{d_\theta}\) for some \(d_\theta\)). The distribution of \(Y\) given \(\vartheta=\theta\) can now be interpreted as a proper conditional distribution, denoted by \(p_{Y \mid \vartheta}(y \mid \vartheta)\).

Bayesian inference relies on the posterior density \[p_{\vartheta \mid Y}(\theta \mid y)=\frac{p_{Y \mid \vartheta}(y \mid \theta)p_\vartheta(\theta)}{p_Y(y)}, \] where \[p_Y(y)=\int_\Theta p_{Y \mid \vartheta}(y \mid \theta)p_\vartheta(\theta)\text{d}\theta\] is called marginal likelihood or evidence.

The posterior distribution can be used to predict the new observation. Assume that we want to predict the next observation \(y_{n+1}\) given \(y=(y_1, \ldots, y_n)\), then the predictive density of \(Y_{n+1}\) having observed \(Y=y\) is \[p_{Y_{n+1} \mid Y}(y_{n+1} \mid y)=\int_\Theta p_{Y_{n+1} \mid Y, \vartheta}(y_{n+1} \mid y, \theta)p_{\vartheta \mid Y}(\theta \mid y)\text{d}\theta.\]

The predictive density above takes into account the uncertainty about \(\theta\). If we had first estimated \(\theta\), say \(\widehat{\theta}\), and plugged the value into a predictive distribution of \(Y_{n+1}\) using \(\widehat{\theta}\), then we would not have taken parameter uncertainty into account.

We can denote \(p(\theta)=p_\vartheta(\theta)\), \(p(y)=p_Y(y)\), \(p(\theta \mid y)=p_{\vartheta \mid Y}(\theta \mid y)\), etc., whenever there is no confusion.