We have a true data distribution pdata(x), and our goal is to make our model distribution pθ(x) to approximate pdata(x). We often choose MLE as the loss function. However, why maximizing logpθ(x) can approximate pdata(x)?
Usually, if we want to describe or quantize the difference between two distributions, we choose KL divergence as the metrics. For example, in this case, we care about minimizing KL(pdata∣∣pθ), and
KL(pdata∣∣pθ)=Epdata(x)[logpθ(x)pdata(x)]=Epdata(x)[logpdata(x)]−Epdata(x)[logpθ(x)]=Constant−Epdata(x)[logpθ(x)]Therefore, minimizing KL(pdata∣∣pθ) is equivalent to maximizing Epdata(x)[logpθ(x)]. That is,
min KL(pdata∥pθ)⟺max Epdata[logpθ]Notably, we care about the expectation on the whole data distribution. As a result, in practice, we have the dataset D={x1,x2,…,xn}, and then we use these samples to approximate expected log likelihood:
L(θ)=n1i=1∑nlogpθ(xi)According to the Law of large numbers, L(θ) will approximate expected log likelihood as n→∞. So, if we have enough samples, approximate θ will converge to the actual θ0, which means pθ(x) will converge to pdata(x).