182 字
1 分钟
How Log-Likelihood Helps Approximate the True Data Distribution?

We have a true data distribution pdata(x)p_{\mathrm{data}}(x), and our goal is to make our model distribution pθ(x)p_{\theta}(x) to approximate pdata(x)p_{\mathrm{data}}(x). We often choose MLE as the loss function. However, why maximizing logpθ(x)\log p_{\theta}(x) can approximate pdata(x)p_{\mathrm{data}}(x)?

Usually, if we want to describe or quantize the difference between two distributions, we choose KL divergence as the metrics. For example, in this case, we care about minimizing KL(pdatapθ)\mathrm{KL}(p_{\mathrm{data}}||p_{\theta}), and

KL(pdatapθ)=Epdata(x)[logpdata(x)pθ(x)]=Epdata(x)[logpdata(x)]Epdata(x)[logpθ(x)]=ConstantEpdata(x)[logpθ(x)]\begin{align} \mathrm{KL}(p_{\mathrm{data}}||p_{\theta}) &= \mathbb{E}_{p_{\mathrm{data}}(x)} \left[ \log \frac{p_{\mathrm{data}}(x)}{p_{\theta}(x)} \right] \\ &= \mathbb{E}_{p_{\mathrm{data}}(x)} [\log p_{\mathrm{data}}(x)] - \mathbb{E}_{p_{\mathrm{data}}(x)} [\log p_{\theta}(x)] \\ &=\mathrm{Constant} - \mathbb{E}_{p_{\mathrm{data}}(x)} [\log p_{\theta}(x)] \end{align}

Therefore, minimizing KL(pdatapθ)\mathrm{KL}(p_{\mathrm{data}}||p_{\theta}) is equivalent to maximizing Epdata(x)[logpθ(x)]\mathbb{E}_{p_{\mathrm{data}}(x)} [\log p_{\theta}(x)]. That is,

min KL(pdatapθ)    max Epdata[logpθ]\min\ \text{KL}(p_{\text{data}} \, \| \, p_{\theta}) \iff \max\ \mathbb{E}_{p_{\text{data}}} [\log p_{\theta}]

Notably, we care about the expectation on the whole data distribution. As a result, in practice, we have the dataset D={x1,x2,,xn}\mathcal{D}=\{ x_{1},x_{2},\dots,x_{n} \}, and then we use these samples to approximate expected log likelihood:

L(θ)=1ni=1nlogpθ(xi)\mathcal{L}(\theta)=\frac{1}{n} \sum_{i=1}^{n} \log p_{\theta}(x_{i})

According to the Law of large numbers, L(θ)\mathcal{L}(\theta) will approximate expected log likelihood as nn\to \infty. So, if we have enough samples, approximate θ\theta will converge to the actual θ0\theta_{0}, which means pθ(x)p_{\theta}(x) will converge to pdata(x)p_{\mathrm{data}}(x).

How Log-Likelihood Helps Approximate the True Data Distribution?
https://fuwari.vercel.app/posts/mle-data-distribution/
作者
pride7
发布于
2024-09-19
许可协议
CC BY-NC-SA 4.0