import scipy
x0 = 10
y = scipy.stats.norm.pdf(x0, 0, 1)
print(y)
prob = 1 - scipy.stats.norm.cdf(x0, 0, 1)
print(prob)
7.69459862670642e-23
0.0
Many tasks in probabilistic machine learning boil down to computing an expectation (integral).
Marginal likelihood for model selection:
\[p(D | M) = \int p(D | \theta_M, M) p(\theta_M | M)d\theta_M.\]
Predictive distribution:
\[p(D^* | D) = \int p(D^* | \theta) p(\theta | D) d\theta.\]
p-values:
\[P(\theta > \theta_0 | D) = \int_{\theta_0}^{\infty} p(\theta | D) d\theta.\]
they all involve computing expectation.
Let \(X\) be a RV with pdf given by \(p\). For function \(g\), we aim to compute
\[\mathbb{E}_{X \sim p}[g(X)] = \int g(x) p(x) dx.\]
Why is this valid?
It is easy to show that
\[\mathbb{E}[\hat{I}] = I.\]
But does it concentrate around \(I\) as we increase \(N\)?
(Weak law of large numbers) If \(X_n \sim p\) i.i.d. and \(\mathbb{E}[X_n] = \mu\), then
\[\frac{1}{N} \sum_i x_i \to_p \mu\]
as \(N \to \infty\).
We can apply the LLN to \(g(X_n)\) to show that it converges to \(\mathbb{E}[g(X)]\).
Note: the assumptions are fairly weak, \(\mathbb{E}[|X|] < \infty\) and that \(g\) be integrable.
Let \(X \sim N(0, 1)\). Estimate the tail probability:
\[P(X > x_0),\]
for some large \(x_0\).
Compute \(P(X > 10)\)?
import scipy
x0 = 10
y = scipy.stats.norm.pdf(x0, 0, 1)
print(y)
prob = 1 - scipy.stats.norm.cdf(x0, 0, 1)
print(prob)
7.69459862670642e-23
0.0
\[\begin{align} P(X > x_0) &= \mathbb{E}[1[X > x_0]] \\ &= \int 1[X > x_0] \phi(x) dx, \end{align}\]
where \(\phi(x)\) denotes standard Normal PDF. Let’s use Monte Carlo sampling:
\[\begin{align} \mathbb{E}[1[X > x_0]] \approx \frac{1}{N} \sum_i 1[x_i > x_0]. \end{align}\]
Number of samples > 10: 0
\[\begin{align} \mathbb{E}[g(X)] &= \int g(x) f(x) dx\\ &= \int g(x) \frac{f(x)}{h(x)} h(x) dx \\ &= \int g(x) w(x) h(x) dx. \end{align}\]
Rather than sampling from \(X_i \sim f\), we sample from more convenient distribution with density function \(h\).
Condition: \(h(x) > 0\) where \(f(x) > 0\)
x = np.random.normal(size=10000, loc=x0, scale=1)
weights = scipy.stats.norm.pdf(x, 0, 1) / scipy.stats.norm.pdf(x, x0, 1)
print(np.mean(weights * (x > x0)))
7.512162906570435e-24
How can we verify that this is correct?
The variance of the importance estimator is finite only when \[ \mathbb{E}_{X \sim h}\left[g^2(X) \frac{f^2(x)}{h^2(x)}\right] < \infty.\]
Effective sample size:
\[ \text{ESS} = \frac{\left(\sum_n w_n\right)^2}{\sum_n w_n^2}.\]
\[\theta^* = \text{argmin} L(\theta).\]
\(\theta^*\) is a local minimum if \[ \exists \delta > 0, \forall \theta \in \Theta \text{ s.t. } \|\theta - \theta^*\| < \delta \Rightarrow L(\theta^*) \leq L(\theta)\]
Let \(g(\theta) = \nabla L(\theta)\) be the gradient vector of \(L\).
Let \(H(\theta) = \nabla^2 L(\theta)\) denote the Hessian matrix.
Let \(g^* = g(\theta)\Big|_{\theta*}\) and \(H^* = H(\theta)\Big|_{\theta^*}\).
If \(\theta^*\) is a local optimum, then
If \(g^* = 0\) and \(H^*\) is positive definite, then \(\theta^*\) is a local optimum.
Note: zero gradient alone is not sufficient, since we could have a saddle point.
PSD ensures \(L(\theta^*) \leq L(\theta)\).
Class of algorithms that utilize the gradient vector.
The descent direction is given by \(d_t = -g_t\). Recall: gradient points to the direction of maximal increase.
Utilize gradient and Hessian to find the optima.
(Newton’s method) At each iteration \(t\):
\[\theta_{t+1} = \theta_t + \eta H_t^{-1} g_t.\]
Because the Hessian encodes the local curvature of the function, multiplying the gradient by \(H^{-1}\) “preconditions” the search direction and counteracts the function’s curvature, yielding more direct steps toward the optimum.
The goal is to minimize an expected loss with respect to some random variable \(z\): \[L(\theta) = \mathbb{E}_{z \sim q}[L(\theta, z)].\]
Example: Monte Carlo Expectation Maximization where the expectation is intractable.
We need the gradient of the expectation of the loss function:
\[g_t = \nabla \mathbb{E}_{z_t \sim q}[L(\theta, z_t)].\]
Update:
\[ \theta_{t+1} = \theta_t - \eta_t g_t.\]
\[L(\theta_t) = \frac{1}{N} \sum_{n=1}^{N} l(y_n, f_{\theta}(x)).\]
\[g_t = \frac{1}{B} \sum_{n \in B_t} \nabla l(y_n, f_{\theta}(x)).\]
To ensure convergence of stochastic gradient descent the learning rate schedule needs to satisfy Robbins-Monro conditions:
\[\eta_t \to 0, \]
and
\[ \frac{\sum \eta_t^2}{\sum \eta_t} \to 0,\]
or \(\sum \eta_t \to \infty\) and \(\sum \eta_t^2 \to 0\).
But these may not be practical choices for many problems.
Take larger steps in directions of continued movements; slow down when the gradients change abruptly.
\[\begin{align} m_t &= \beta m_{t-1} + g_{t-1} \\ \theta_t &= \theta_{t-1} + \eta_t m_t, \end{align}\]
\(0 < \beta < 1\), typically \(\beta \approx 0.9\).
None of these methods satisfy Robbins-Monro and therefore, are not guaranteed to converge. However, they work well in practice.