946 字
5 分钟
Denoising Diffusion Probabilistic Model (DDPM)

Diffusion models are incremental updates where the assembly of the whole gives us the encoder-decoder structure. The transition from one state to another is realized by a denoiser.

Structure DDPM_1 It is called the variational diffusion model. The variational diffusion model has a sequence of states x0,x1,,xT\mathbf{x}_{0},\mathbf{x}_{1},\dots,\mathbf{x}_{T} :

  • x0\mathbf{x}_{0}: the original image
  • xT\mathbf{x}_{T}: the latent variable. We want xTN(0,I)\mathbf{x}_{T}\sim \mathcal{N}(0, \mathbf{I}).
  • x1,,xT1\mathbf{x}_{1},\dots,\mathbf{x}_{T-1}: the intermediate states.

1 Building Blocks#

Transition Block#

The tt-th transition block consists of three states xt1,xt\mathbf{x}_{t-1},\mathbf{x}_{t}, and xt+1\mathbf{x_{t+1}}.

  • The forward transition ( xt1xt\mathbf{x}_{t-1}\to \mathbf{x}_{t} ). Transition distribution p(xtxt1)p(\mathbf{x}_{t}|\mathbf{x}_{t−1}). We approximate it by a Gaussian qϕ(xtxt1)q_{\phi}(\mathbf{x}_{t}|\mathbf{x}_{t-1}).
  • The reverse transition xt+1xt\mathbf{x}_{t+1} \to \mathbf{x}_{t}. Transition distribution p(xt+1xt)p(\mathbf{x}_{t+1}|\mathbf{x}_{t}). Use another Gaussian pθ(xt+1xt)p_{\theta}(\mathbf{x}_{t+1}|\mathbf{x}_{t}) to approximate it (neural network).

DDPM_2

Initial Block: x0\mathbf{x}_{0}#

We only need to worry about p(x0x1)p(\mathbf{x}_{0}|\mathbf{x}_{1}) and approximate it by a Gaussian pθ(x0x1)p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{1}) where the mean is computed through a neural network.

DDPM_3

Final Block: xT\mathbf{x}_{T}#

It should be a white Gaussian noise vector. The forward transition is approximated by qϕ(xTxT1)q_{\phi}(\mathbf{x}_{T} |\mathbf{x}_{T-1}), which is a Gaussian.

DDPM_4

Understanding the Transition Distribution qϕ(xtxt1)q_{\phi}(\mathbf{x}_{t}|\mathbf{x}_{t-1})#

In a DDPM model, the transition distribution qϕ(xtxt1)q_{\phi}(\mathbf{x}_{t}|\mathbf{x}_{t-1}) is defined as

qϕ(xtxt1)=defN(xtαtxt1,(1αt)I)q_{\boldsymbol{\phi}}(\mathbf{x}_t|\mathbf{x}_{t-1})\stackrel{\mathrm{def}}{=}\mathcal{N}(\mathbf{x}_t|\sqrt{\alpha_t}\mathbf{x}_{t-1},(1-\alpha_t)\mathbf{I})
  • The scaling factor αt\sqrt{ \alpha_{t} } is to make sure that the variance magnitude is preserved so that it will not explode and vanish after many iterations.
  • It also means xt=αtxt1+1αtϵt1\mathbf{x}_{t} = \sqrt{ \alpha_{t} }\mathbf{x}_{t-1}+\sqrt{ 1-\alpha_{t} }\epsilon_{t-1}, ϵt1N(0,I)\epsilon_{t-1}\sim \mathcal{N}(0,\mathbf{I}).

2 The magical scalars αt\sqrt{ \alpha_{t} } and 1αt1-\alpha_{t}#

Why αt\sqrt{ \alpha_{t} } and 1αt1-\alpha_{t}? Let’s define the transition distribution as

qϕ(xtxt1)=N(xtaxt1,b2I).q_{\boldsymbol{\phi}}(\mathbf{x}_t|\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_t|a\mathbf{x}_{t-1},b^2\mathbf{I}).

where aR,bRa \in \mathbb{R}, b \in \mathbb{R}.

Proof. First, we have

xt=axt1+bϵt1,whereϵt1N(0,I).\mathbf{x}_t=a\mathbf{x}_{t-1}+b\boldsymbol{\epsilon}_{t-1},\quad\mathrm{where}\quad\boldsymbol{\epsilon}_{t-1}\sim\mathcal{N}(0,\mathbf{I}).

Then, we carry on the recursion

xt=axt1+bϵt1=a(axt2+bϵt2)+bϵt1=a2xt2+abϵt2+bϵt1=:=atx0+b[ϵt1+aϵt2+a2ϵt3++at1ϵ0]=defwt.\begin{aligned} \mathbf{x}_{t}& =a\mathbf{x}_{t-1}+b\boldsymbol{\epsilon}_{t-1} \\ &=a(a\mathbf{x}_{t-2}+b\boldsymbol{\epsilon}_{t-2})+b\boldsymbol{\epsilon}_{t-1} \\ &=a^2\mathbf{x}_{t-2}+ab\boldsymbol{\epsilon}_{t-2}+b\boldsymbol{\epsilon}_{t-1} \\ &=: \\ &=a^t\mathbf{x}_0+b\underbrace{\left[\boldsymbol{\epsilon}_{t-1}+a\boldsymbol{\epsilon}_{t-2}+a^2\boldsymbol{\epsilon}_{t-3}+\ldots+a^{t-1}\boldsymbol{\epsilon}_0\right]}_{\overset{\mathrm{def}}{\operatorname*{=}}\mathbf{w}_t}. \end{aligned}

It is clear that E[wt]=0\mathbb{E}[\mathbf{w}_{t}]=0.The covariance matrix

Cov[wt]=defb2(Cov(ϵt1)+a2Cov(ϵt2)++(at1)2Cov(ϵ0))=b2(1+a2+a4++a2(t1))I=b21a2t1a2I.\begin{aligned} \mathrm{Cov}[\mathbf{w}_{t}]&\stackrel{\mathrm{def}}{=} b^{2}(\mathrm{Cov}(\boldsymbol{\epsilon}_{t-1})+a^{2}\mathrm{Cov}(\boldsymbol{\epsilon}_{t-2})+\ldots+(a^{t-1})^{2}\mathrm{Cov}(\boldsymbol{\epsilon}_{0})) \\ &=b^2(1+a^2+a^4+\ldots+a^{2(t-1)})\mathbf{I} \\ &=b^2\cdot\frac{1-a^{2t}}{1-a^2}\mathbf{I}. \end{aligned}

As t,at0t\to\infty,a^t\to0 for any 0<a<1.0<a<1. Therefore, at the limit when t=t=\infty, limtCov[wt]=b21a2I.\lim\limits_{t\to\infty}\mathrm{Cov}[\mathbf{w}_t]=\frac{b^2}{1-a^2}\mathbf{I}. So, if we want limtCov[wt]=I\lim_{t\to\infty}\text{Cov}[\mathbf{w}_t]=\mathbf{I} (so that the distribution of xt\mathbf{x}_t will approach N(0,I)\mathcal{N}(0,\mathbf{I}), then b=1a2b=\sqrt{1-a^{2}}.

Now, if we let a=αa=\sqrt{\alpha}, then b=1α.b=\sqrt1-\alpha. This will give us

xt=αxt1+1αϵt1\mathbf{x}_t=\sqrt{\alpha}\mathbf{x}_{t-1}+\sqrt{1-\alpha}\boldsymbol{\epsilon}_{t-1}

Or equivalently, qϕ(xxt1)=N(xtαxt1,(1α)I).q_{\boldsymbol{\phi}}(\mathbf{x}|\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_{t}\mid\sqrt{\alpha}\mathbf{x}_{t-1},(1-\alpha)\mathbf{I}). You can replace α\alpha by αt\alpha_t, if you prefer a scheduler.

3 Distribution qϕ(xtx0)q_{\phi}(\mathbf{x}_{t}|\mathbf{x}_{0})#

Conditional distribution qϕ(xtx0)q_{\phi}(\mathbf{x}_{t}|\mathbf{x}_{0})

qϕ(xtx0)=N(xtαtx0,(1αt)I),q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t}\mid\sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0},(1-\overline{\alpha}_{t})\mathbf{I}),

where αt=i=1tαi\overline{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}.

Proof.

xt=αtxt1+1αtϵt1=αt(αt1xt2+1αt1ϵt2)+1αtϵt1=αtαt1xt2+αt1αt1ϵt2+1αtϵt1w1.\begin{aligned}\mathbf{x}_{t}&=\sqrt{\alpha_t}\mathbf{x}_{t-1}+\sqrt{1-\alpha_t}\boldsymbol{\epsilon}_{t-1}\\&=\sqrt{\alpha_t}(\sqrt{\alpha_{t-1}}\mathbf{x}_{t-2}+\sqrt{1-\alpha_{t-1}}\boldsymbol{\epsilon}_{t-2})+\sqrt{1-\alpha_t}\boldsymbol{\epsilon}_{t-1}\\&=\sqrt{\alpha_t\alpha_{t-1}}\mathbf{x}_{t-2}+\underbrace{\sqrt{\alpha_t}\sqrt{1-\alpha_{t-1}}\boldsymbol{\epsilon}_{t-2}+\sqrt{1-\alpha_t}\boldsymbol{\epsilon}_{t-1}}_{\mathbf{w}_1}.\end{aligned}

The new covariance is

E[w1w1T]=[(αt1αt1)2+(1αt)2]I=[αt(1αt1)+1αt]I=[1αtαt1]I.\begin{aligned}\mathbb{E}[\mathbf{w}_{1}\mathbf{w}_{1}^{T}]&=[(\sqrt{\alpha_{t}}\sqrt{1-\alpha_{t-1}})^{2}+(\sqrt{1-\alpha_{t}})^{2}]\mathbf{I}\\&=[\alpha_t(1-\alpha_{t-1})+1-\alpha_t]\mathbf{I}=[1-\alpha_t\alpha_{t-1}]\mathbf{I}.\end{aligned}

We can show that the recursion is updated to become a linear combination of xt2\mathbf{x}_{t-2} and a noise vector ϵt2:\boldsymbol\epsilon_t-2:

xt=αtαt1xt2+1αtαt1ϵt2=αtαt1αt2xt3+1αtαt1αt2ϵt3==i=1tαix0+1i=1tαiϵ0.\begin{aligned}\mathbf{x}_{t}&=\sqrt{\alpha_{t}\alpha_{t-1}}\mathbf{x}_{t-2}+\sqrt{1-\alpha_{t}\alpha_{t-1}}\boldsymbol{\epsilon}_{t-2}\\&=\sqrt{\alpha_{t}\alpha_{t-1}\alpha_{t-2}}\mathbf{x}_{t-3}+\sqrt{1-\alpha_{t}\alpha_{t-1}\alpha_{t-2}}\boldsymbol{\epsilon}_{t-3}\\&=\vdots\\&=\sqrt{\prod_{i=1}^t\alpha_i}\mathbf{x}_0+\sqrt{1-\prod_{i=1}^t\alpha_i}\boldsymbol{\epsilon}_0.\end{aligned}

So, if we define αt=i=1tαi\overline{\alpha}_t=\prod_{i=1}^t\alpha_i, we can show that

xt=αtx0+1αtϵ0.\mathbf{x}_{t}=\sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\overline{\alpha}_{t}}\epsilon_{0}.

In other words, the distribution qϕ(xtx0)q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{0}) is

xtqϕ(xtx0)=N(xtαtx0,(1αt)I).\mathbf{x}_t\sim q_{\boldsymbol{\phi}}(\mathbf{x}_t|\mathbf{x}_0)=\mathcal{N}(\mathbf{x}_t\:|\:\sqrt{\overline{\alpha}_t}\mathbf{x}_0,\:(1-\overline{\alpha}_t)\mathbf{I}).

DDPM_5

4 Evidence Lower Bound#

The ELBO for the variational diffusion model is

ELBOϕ,θ(x)=Eqϕ(x1x0)[logpθ(x0x1)how good the initial block is]Eqϕ(xT1x0)[DKL(qϕ(xTxT1)p(xT))how good the final block is]t=1T1Eqϕ(xt1,xt+1x0)[DKL(qϕ(xtxt1)pθ(xtxt+1))how good the transition blocks are]\begin{aligned} \mathrm{ELBO}_{\phi,\theta}(\mathbf{x})= &\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1}|\mathbf{x}_{0})}\bigg[\log\underbrace{p_{\boldsymbol{\theta}}(\mathbf{x}_{0}|\mathbf{x}_{1})}_{\mathrm{how~good~the~initial~block~is}}\bigg] \\ &-\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{T-1}|\mathbf{x}_{0})}\left[\underbrace{\mathbb{D}_{\mathrm{KL}}\left(q_{\boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{T-1})\|p(\mathbf{x}_{T})\right)}_{\text{how good the final block is}}\right] \\ &-\sum_{t=1}^{T-1}\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1},\mathbf{x}_{t+1}|\mathbf{x}_{0})}\Big[\underbrace{\mathbb{D}_{\mathrm{KL}}\Big(q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{t-1})\|p_{\boldsymbol{\theta}}(\mathbf{x}_{t}|\mathbf{x}_{t+1})\Big)}_{\mathrm{how~good~the~transition~blocks~are}}\Big] \end{aligned}

Interpretation of ELBO:

  • Reconstruction. Use log-likelihood pθ(x0x1)p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{1}) to measure.
  • Prior Matching. Use KL divergence to measure the difference between qϕ(xTxT1)q_{\phi}(\mathbf{x}_{T}|\mathbf{x}_{T −1}) and p(xT)p(\mathbf{x}_{T}).
  • Consistency. The forward transition is determined by the distribution qϕ(xtxt1)q_{\phi}(\mathbf{x}_{t}|\mathbf{x}_{t-1}) whereas the reverse transition is determined by the neural network pθ(xtxt+1).p_{\boldsymbol{\theta}}(\mathbf{x}_{t}|\mathbf{x}_{t+1}). The consistency term uses the KL divergence to measure the deviation.

Proof.

logp(x)=logp(x0)=logp(x0:T)dx1:T=logp(x0:T)qϕ(x1:Tx0)qϕ(x1:Tx0)dx1:T=logqϕ(x1:Tx0)[p(x0:T)qϕ(x1:Tx0)]dx1:T=logEqϕ(x1:Tx0)[p(x0:T)qϕ(x1:Tx0)]Eqϕ(x1:Tx0)[logp(x0:T)qϕ(x1:Tx0)]\begin{aligned} \operatorname{log}p(\mathbf{x})& =\log p(\mathbf{x}_{0}) \\ &=\log\int p(\mathbf{x}_{0:T})d\mathbf{x}_{1:T} \\ &=\log\int p(\mathbf{x}_{0:T})\frac{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}d\mathbf{x}_{1:T} \\ &=\log\int q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})\left[\frac{p(\mathbf{x}_{0:T})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}\right]d\mathbf{x}_{1:T} \\ &=\log\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}\left[\frac{p(\mathbf{x}_{0:T})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}\right]\\ &\geq \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}\left[\log\frac{p(\mathbf{x}_{0:T})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}\right] \end{aligned}

The last inequality follows from Jensen’s inequality. Note that

p(x0:T)=p(xT)t=1Tp(xt1xt)=p(xT)p(x0x1)t=2Tp(xt1xt)p(\mathbf{x}_{0:T})=p(\mathbf{x}_T)\prod_{t=1}^Tp(\mathbf{x}_{t-1}|\mathbf{x}_t)=p(\mathbf{x}_T)p(\mathbf{x}_0|\mathbf{x}_1)\prod_{t=2}^Tp(\mathbf{x}_{t-1}|\mathbf{x}_t)qϕ(x1:Tx0)=t=1Tqϕ(xtxt1)=qϕ(xTxT1)t=1T1qϕ(xtxt1).q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_0)=\prod_{t=1}^Tq_{\boldsymbol{\phi}}(\mathbf{x}_t|\mathbf{x}_{t-1})=q_{\boldsymbol{\phi}}(\mathbf{x}_T|\mathbf{x}_{T-1})\prod_{t=1}^{T-1}q_{\boldsymbol{\phi}}(\mathbf{x}_t|\mathbf{x}_{t-1}).

Then

logp(x)Eqϕ(x1:Tx0)[logp(x0:T)qϕ(x1:Tx0)]=Eqϕ(x1:Tx0)[logp(xT)p(x0x1)t=2Tp(xt1xt)qϕ(xTxT1)t=1T1qϕ(xtxt1)]=Eqϕ(x1:Tx0)[logp(xT)p(x0x1)t=1T1p(xtxt+1)qϕ(xTxT1)t=1T1qϕ(xtxt1)]=Eqϕ(x1:Tx0)[logp(xT)p(x0x1)qϕ(xTxT1)]+Eqϕ(x1:Tx0)[logt=1T1p(xtxt+1)qϕ(xtxt1)]\begin{aligned} \operatorname{log}p(\mathbf{x})& \geq\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}\left[\log\frac{p(\mathbf{x}_{0:T})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}\right] \\ &=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log\frac{p(\mathbf{x}_T)p(\mathbf{x}_0|\mathbf{x}_1)\prod_{t=2}^Tp(\mathbf{x}_{t-1}|\mathbf{x}_t)}{q_{\boldsymbol{\phi}}(\mathbf{x}_T|\mathbf{x}_{T-1})\prod_{t=1}^{T-1}q_{\boldsymbol{\phi}}(\mathbf{x}_t|\mathbf{x}_{t-1})}\right] \\ &=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log\frac{p(\mathbf{x}_T)p(\mathbf{x}_0|\mathbf{x}_1)\prod_{t=1}^{T-1}p(\mathbf{x}_t|\mathbf{x}_{t+1})}{q_{\boldsymbol{\phi}}(\mathbf{x}_T|\mathbf{x}_{T-1})\prod_{t=1}^{T-1}q_{\boldsymbol{\phi}}(\mathbf{x}_t|\mathbf{x}_{t-1})}\right] \\ &=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}\left[\log\frac{p(\mathbf{x}_{T})p(\mathbf{x}_{0}|\mathbf{x}_{1})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{T-1})}\right]+\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}\left[\log\prod_{t=1}^{T-1}\frac{p(\mathbf{x}_{t}|\mathbf{x}_{t+1})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{t-1})}\right] \end{aligned}

The Reconstruction term can be simplified as

Eqϕ(x1:Tx0)[logp(x0x1)]=Eqϕ(x1x0)[logp(x0x1)],\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_0)}\bigg[\log p(\mathbf{x}_0|\mathbf{x}_1)\bigg]=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_1|\mathbf{x}_0)}\bigg[\log p(\mathbf{x}_0|\mathbf{x}_1)\bigg],

where we used the fact that the conditioning x1:Tx0\mathbf{x}_1:T|\mathbf{x}_0 is equivalent to x1x0.\mathbf{x}_1|\mathbf{x}_0. The Prior Matching term is

Eqϕ(x1:Tx0)[logp(xT)qϕ(xTxT1)]=Eqϕ(xT,xT1x0)[logp(xT)qϕ(xTxT1)]=Eqϕ(xT1,xTx0)[DKL(qϕ(xTxT1)p(xT))],\begin{aligned}\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}\left[\log\frac{p(\mathbf{x}_{T})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{T-1})}\right]&=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{T},\mathbf{x}_{T-1}|\mathbf{x}_{0})}\left[\log\frac{p(\mathbf{x}_{T})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{T-1})}\right]\\&=-\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{T-1},\mathbf{x}_{T}|\mathbf{x}_{0})}\bigg[\mathbb{D}_{\mathrm{KL}}\left(q_{\boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{T-1})\|p(\mathbf{x}_{T})\right)\bigg],\end{aligned}

Finally, we can show that

Eqϕ(x1:Tx0)[logt=1T1p(xtxt+1)qϕ(xtxt1)]=t=1T1Eqϕ(x1:Tx0)[logp(xtxt+1)qϕ(xtxt1)]=t=1T1Eqϕ(xt1,xt,xt+1x0)[logp(xtxt+1)qϕ(xtxt1)]=t=1T1Eqϕ(xt1,xt+1x0)[DKL(qϕ(xtxt1)p(xtxt+1))]consistency.\begin{aligned} \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_0)}\left[\log\prod_{t=1}^{T-1}\frac{p(\mathbf{x}_t|\mathbf{x}_{t+1})}{q_{\boldsymbol{\phi}}(\mathbf{x}_t|\mathbf{x}_{t-1})}\right] &=\sum_{t=1}^{T-1}\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}\left[\log\frac{p(\mathbf{x}_{t}|\mathbf{x}_{t+1})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{t-1})}\right] \\ &=\sum_{t=1}^{T-1}\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1},\mathbf{x}_{t},\mathbf{x}_{t+1}|\mathbf{x}_{0})}\left[\log\frac{p(\mathbf{x}_{t}|\mathbf{x}_{t+1})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{t-1})}\right] \\ &=\underbrace{-\sum_{t=1}^{T-1}\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1},\mathbf{x}_{t+1}|\mathbf{x}_{0})}\left[\mathbb{D}_{\mathrm{KL}}\left(q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{t-1})\|p(\mathbf{x}_{t}|\mathbf{x}_{t+1})\right)\right]}_{\text{consistency}}. \end{aligned}

By replacing p(x0x1)p(\mathbf{x}_0|\mathbf{x}_1) with pθ(x0x1)p_{\boldsymbol{\theta}}(\mathbf{x}_0|\mathbf{x}_1) and p(xtxt+1)p(\mathbf{x}_t|\mathbf{x}_{t+1}) with pθ(xtxt+1)p_{\boldsymbol{\theta}}(\mathbf{x}_t|\mathbf{x}_{t+1}), we are done.

5 Rewrite the Consistency Term#

The difficulty above is that we need to draw samples (xt1,xt+1)(\mathbf{x}_{t-1},\mathbf{x}_{t+1}) from a joint distribution qϕ(xt1,xt+1x0)q_{\phi}(\mathbf{x}_{t-1},\mathbf{x}_{t+1}|\mathbf{x}_{0}).

With Bayes theorem:

q(xtxt1)=q(xt1xt)q(xt)q(xt1)condition on x0q(xtxt1,x0)=q(xt1xt,x0)q(xtx0)q(xt1x0).q(\mathbf{x}_t|\mathbf{x}_{t-1})=\frac{q(\mathbf{x}_{t-1}|\mathbf{x}_t)q(\mathbf{x}_t)}{q(\mathbf{x}_{t-1})}\quad\stackrel{\text{condition on }\mathbf{x}_0}{\Longrightarrow}q(\mathbf{x}_t|\mathbf{x}_{t-1},\mathbf{x}_0)=\frac{q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)q(\mathbf{x}_t|\mathbf{x}_0)}{q(\mathbf{x}_{t-1}|\mathbf{x}_0)}.

A natural option is to calculate the KL divergence between qϕ(xt1xt,x0)q_{\phi}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0}) and pθ(xt1xt)p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t}). DDPM_6

A problem might be why don’t we change pθp_{\theta}? The reason is that we need we don’t know the distribution of pθ(x0)p_{\theta}(x_{0}), but we can know pθ(xT)p_{\theta}(x_{T}). This determines that pθp_{\theta} can just be reverse process.

Then, the ELBO for a variational diffusion model is

ELBOϕ,θ(x)=Eqϕ(x1x0)[logpθ(x0x1)same as before]DKL(qϕ(xTx0)p(xT))new prior matchingt=2TEqϕ(xtx0)[DKL(qϕ(xt1xt,x0)pθ(xt1xt))new consistency].\begin{aligned}\mathrm{ELBO}_{\boldsymbol{\phi},\boldsymbol{\theta}}(\mathbf{x})&=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1}|\mathbf{x}_{0})}[\log\underbrace{p_{\boldsymbol{\theta}}(\mathbf{x}_{0}|\mathbf{x}_{1})}_{\mathrm{same~as~before}}]-\underbrace{\mathbb{D}_{\mathrm{KL}}\Big(q_{\boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{0})\|p(\mathbf{x}_{T})\Big)}_{\mathrm{new~prior~matching}}\\&-\sum_{t=2}^T\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_t|\mathbf{x}_0)}\Big[\underbrace{\mathbb{D}_{\mathrm{KL}}\Big(q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)\|p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_t)\Big)}_{\mathrm{new~consistency}}\Big].\end{aligned}

6 Derivation of qϕ(xt1xt,x0)q_{\phi}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})#

The distribution qϕ(xt1xt,x0)q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0}) takes the form of qϕ(xt1xt,x0)=N(xt1μq(xt,x0),Σq(t)),q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)=\mathcal{N}(\mathbf{x}_{t-1}\:|\:\boldsymbol{\mu}_q(\mathbf{x}_t,\mathbf{x}_0),\boldsymbol{\Sigma}_q(t)), where

μq(xt,x0)=(1αt1)αt1αtxt+(1αt)αt11αtx0Σq(t)=(1αt)(1αt1)1αtI=defσq2(t)I.\begin{aligned} \boldsymbol{\mu}_{q}(\mathbf{x}_{t},\mathbf{x}_{0})&=\frac{(1-\overline{\alpha}_{t-1})\sqrt{\alpha_{t}}}{1-\overline{\alpha}_{t}}\mathbf{x}_{t}+\frac{(1-\alpha_{t})\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_{t}}\mathbf{x}_{0}\\\boldsymbol{\Sigma}_{q}(t)&=\frac{(1-\alpha_t)(1-\sqrt{\overline{\alpha}_{t-1}})}{1-\overline{\alpha}_t}\mathbf{I}\stackrel{\mathrm{def}}{=}\sigma_q^2(t)\mathbf{I}. \end{aligned}

use the property of quadratic functions (assuming Gaussian first).

The interesting part is that qϕ(xt1xt,x0)q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0) is completely characterized by xt\mathbf{x}_{t} and x0\mathbf{x}_{0}. There is no neural network required to estimate the mean and variance! So, there is really nothing to “learn”.

To quickly calculate the KL divergence

DKL(qϕ(xt1xt,x0)nothing to learnpθ(xt1xt)need to do something),\mathbb{D}_{\mathrm{KL}}(\underbrace{q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})}_{\text{nothing to learn}}\parallel\underbrace{p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}_{\text{need to do something}}),

we assume pθ(xt1xt)p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t}) is also a Gaussian. To this end, we pick

pθ(xt1xt)=N(xt1μθ(xt)neural network,σq2(t)I),p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})=\mathcal{N}\Big(\mathbf{x}_{t-1}|\underbrace{\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_{t})}_{\text{neural network}},\sigma_{q}^{2}(t)\mathbf{I}\Big),

where we assume that the mean vector can be determined using a neural network. As for the variance, we choose the variance to be σq2(t)\sigma_{q}^{2}(t). Now, we compare two distributions:

(xt1xt,x0)=N(xt1μq(xt,x0)known,σq2(t)Iknown),pθ(xt1xt)=N(xt1μθ(xt)neural network,σq2(t)Iknown).\begin{aligned} (\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})=\mathcal{N}\Big(\mathbf{x}_{t-1}\mid\underbrace{\boldsymbol{\mu}_{q}(\mathbf{x}_{t},\mathbf{x}_{0})}_{\mathrm{known}},\underbrace{\sigma_{q}^{2}(t)\mathbf{I}}_{\mathrm{known}}\Big),\\p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})=\mathcal{N}\Big(\mathbf{x}_{t-1}|\underbrace{\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_{t})}_{\mathrm{neural~network}},\underbrace{\sigma_{q}^{2}(t)\mathbf{I}}_{\mathrm{known}}\Big). \end{aligned}

Therefore, the KL divergence is simplified to

DKL(qϕ(xt1xt,x0)pθ(xt1xt))=DKL(N(xt1μq(xt,x0),σq2(t)I)N(xt1μθ(xt),σq2(t)I))=12σq2(t)μq(xt,x0)μθ(xt)2,\begin{aligned}&\mathbb{D}_{\mathrm{KL}}\Big(q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)\|p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_t)\Big)\\&=\mathbb{D}_{\mathrm{KL}}\Big(\mathcal{N}(\mathbf{x}_{t-1}\mid\boldsymbol{\mu}_q(\mathbf{x}_t,\mathbf{x}_0),\sigma_q^2(t)\mathbf{I})\|\mathcal{N}(\mathbf{x}_{t-1}\mid\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_t),\sigma_q^2(t)\mathbf{I})\Big)\\&=\frac1{2\sigma_q^2(t)}\|\boldsymbol{\mu}_q(\mathbf{x}_t,\mathbf{x}_0)-\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_t)\|^2,\end{aligned}

where we used the fact that the KL divergence between two identical-variance Gaussians is just the Euclidean distance square between the two mean vectors. If we go back to the definition of ELBO , we can rewrite it as

ELBOθ(x)=Eq(x1x0)[logpθ(x0x1)]DKL(q(xTx0)p(xT))nothing to traint=2TEq(xtx0)[12σq2(t)μq(xt,x0)μθ(xt)2].\begin{aligned}\operatorname{ELBO}_{\boldsymbol{\theta}}(\mathbf{x})&=\mathbb{E}_{q(\mathbf{x}_1|\mathbf{x}_0)}[\log p_{\boldsymbol{\theta}}(\mathbf{x}_0|\mathbf{x}_1)]-\underbrace{\mathbb{D}_{\mathrm{KL}}\left(q(\mathbf{x}_T|\mathbf{x}_0)\|p(\mathbf{x}_T)\right)}_{\text{nothing to train}}\\&-\sum_{t=2}^T\mathbb{E}_{q(\mathbf{x}_t|\mathbf{x}_0)}\bigg[\frac1{2\sigma_q^2(t)}\|\boldsymbol{\mu}_q(\mathbf{x}_t,\mathbf{x}_0)-\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_t)\|^2\bigg].\end{aligned}
  • Given xtq(xtx0)\mathbf{x}_t\sim q(\mathbf{x}_t|\mathbf{x}_0), we can calculate logpθ(x0x1)\log p_{\boldsymbol{\theta}}(\mathbf{x}_0|\mathbf{x}_1), which is just logN(x0μθ(x1),σq2(1)I).\log\mathcal{N}(\mathbf{x}_0|\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_1),\sigma_q^2(1)\mathbf{I}). So, as soon as we know x1\mathbf{x}_1, we can send it to a network μθ(x1)\boldsymbol\mu_{\theta}(\mathbf{x}_1) to return us a mean estimate. The mean estimate will then be used to compute the likelihood.

7 Training and Inference#

The ELBO suggests that we need to find a network μθ\boldsymbol{\mu_\mathrm{\theta}} that can somehow minimize this loss:

12σq2(t)μq(xt,x0)knownμθ(xt)network2.\frac1{2\sigma_q^2(t)}\|\underbrace{\boldsymbol{\mu}_q(\mathbf{x}_t,\mathbf{x}_0)}_{\text{known}}-\underbrace{\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_t)}_{\text{network}}\|^2.

We recall that

μq(xt,x0)=(1αt1)αt1αtxt+(1αt)αt11αtx0.\boldsymbol{\mu}_q(\mathbf{x}_t,\mathbf{x}_0)=\frac{(1-\overline{\alpha}_{t-1})\sqrt{\alpha_t}}{1-\overline{\alpha}_t}\mathbf{x}_t+\frac{(1-\alpha_t)\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_t}\mathbf{x}_0.

Since μθ\boldsymbol{\mu_{\theta}} is our designdesign, there is no reason why we cannot define it as something more convenient. So here is an option:

μθa network(xt)=def(1αt1)αt1αtxt+(1αt)αt11αtx^θ(xt)another network.\underbrace{\boldsymbol{\mu_{\theta}}}_{\mathrm{a~network}}(\mathbf{x}_{t})\stackrel{\mathrm{def}}{=}\frac{(1-\overline{\alpha}_{t-1})\sqrt{\alpha_{t}}}{1-\overline{\alpha}_{t}}\mathbf{x}_{t}+\frac{(1-\alpha_{t})\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_{t}}\underbrace{\widehat{\mathbf{x}}_{\boldsymbol{\theta}}(\mathbf{x}_{t})}_{\mathrm{another~network}}\:.

Then we have

12σq2(t)μq(xt,x0)μθ(xt)2=12σq2(t)(1αt)αt11αt(x^θ(xt)x0)2=12σq2(t)(1αt)2αt1(1αt)2x^θ(xt)x02\begin{aligned}\frac{1}{2\sigma_{q}^{2}(t)}\|\boldsymbol{\mu}_{q}(\mathbf{x}_{t},\mathbf{x}_{0})-\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_{t})\|^{2}&=\frac{1}{2\sigma_{q}^{2}(t)}\left\|\frac{(1-\alpha_{t})\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_{t}}(\widehat{\mathbf{x}}_{\boldsymbol{\theta}}(\mathbf{x}_{t})-\mathbf{x}_{0})\right\|^{2}\\&=\frac1{2\sigma_q^2(t)}\frac{(1-\alpha_t)^2\overline{\alpha}_{t-1}}{(1-\overline{\alpha}_t)^2}\left\|\widehat{\mathbf{x}}_{\boldsymbol{\theta}}(\mathbf{x}_t)-\mathbf{x}_0\right\|^2\end{aligned}

Therefore ELBO can be simplified into

ELBOθ=Eq(x1x0)[logpθ(x0x1)]t=2TEq(xtx0)[12σq2(t)μq(xt,x0)μθ(xt)2]=Eq(x1x0)[logpθ(x0x1)]t=2TEq(xtx0)[12σq2(t)(1αt)2αt1(1αt)2x^θ(xt)x02].\begin{aligned} \mathrm{ELBO}_{\boldsymbol{\theta}}& =\mathbb{E}_{q(\mathbf{x}_1|\mathbf{x}_0)}[\log p_{\boldsymbol{\theta}}(\mathbf{x}_0|\mathbf{x}_1)]-\sum_{t=2}^T\mathbb{E}_{q(\mathbf{x}_t|\mathbf{x}_0)}\Big[\frac1{2\sigma_q^2(t)}\|\boldsymbol{\mu}_q(\mathbf{x}_t,\mathbf{x}_0)-\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_t)\|^2\Big] \\ &=\mathbb{E}_{q(\mathbf{x}_1|\mathbf{x}_0)}[\log p_{\boldsymbol{\theta}}(\mathbf{x}_0|\mathbf{x}_1)]-\sum_{t=2}^T\mathbb{E}_{q(\mathbf{x}_t|\mathbf{x}_0)}\Big[\frac1{2\sigma_q^2(t)}\frac{(1-\alpha_t)^2\overline{\alpha}_{t-1}}{(1-\overline{\alpha}_t)^2}\left\|\widehat{\mathbf{x}}_{\boldsymbol{\theta}}(\mathbf{x}_t)-\mathbf{x}_0\right\|^2\Big]. \end{aligned}

The first term is

logpθ(x0x1)=logN(x0μθ(x1),σq2(1)I)12σq2(1)μθ(x1)x02definition=12σq2(1)(1α0)α11α1x1+(1α1)α01α1x^θ(x1)x02recall α0=1=12σq2(1)(1α1)1α1x^θ(x1)x02=12σq2(1)x^θ(x1)x02recall α1=α1\begin{aligned} \\ \log p_{\boldsymbol{\theta}}(\mathbf{x}_0|\mathbf{x}_1)& =\log\mathcal{N}(\mathbf{x}_0|\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_1),\sigma_q^2(1)\mathbf{I})\propto-\frac{1}{2\sigma_q^2(1)}\|\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_1)-\mathbf{x}_0\|^2 && \text{definition} \\ &=-\frac{1}{2\sigma_q^2(1)}\left\|\frac{(1-\overline{\alpha}_0)\sqrt{\alpha_1}}{1-\overline{\alpha}_1}\mathbf{x}_1+\frac{(1-\alpha_1)\sqrt{\overline{\alpha}_0}}{1-\overline{\alpha}_1}\widehat{\mathbf{x}}_{\boldsymbol{\theta}}(\mathbf{x}_1)-\mathbf{x}_0\right\|^2&& \mathrm{recall~}\alpha_{0}=1 \\ &=-\frac{1}{2\sigma_{q}^{2}(1)}\left\|\frac{(1-\alpha_{1})}{1-\overline{\alpha}_{1}}\widehat{\mathbf{x}}_{\boldsymbol{\theta}}(\mathbf{x}_{1})-\mathbf{x}_{0}\right\|^{2}=-\frac{1}{2\sigma_{q}^{2}(1)}\left\|\widehat{\mathbf{x}}_{\boldsymbol{\theta}}(\mathbf{x}_{1})-\mathbf{x}_{0}\right\|^{2}\quad\mathrm{recall~}\overline{\alpha}_{1}=\alpha_{1} \\ \end{aligned}

Then ELBO will be

ELBOθ=t=1TEq(xtx0)[12σq2(t)(1αt)2αt1(1αt)2x^θ(xt)x02].\begin{aligned} &\mathrm{ELBO}_{\boldsymbol{\theta}}=-\sum_{t=1}^T\mathbb{E}_{q(\mathbf{x}_t|\mathbf{x}_0)}\Big[\frac{1}{2\sigma_q^2(t)}\frac{(1-\alpha_t)^2\overline{\alpha}_{t-1}}{(1-\overline{\alpha}_t)^2}\left\|\widehat{\mathbf{x}}_{\boldsymbol{\theta}}(\mathbf{x}_t)-\mathbf{x}_0\right\|^2\Big]. \end{aligned}

Therefore, the training of the neural network boils down to a simple loss function:

The loss function for a denoising diffusion probabilistic model:

θ=argminθt=1T12σq2(t)(1αt)2αt1(1αt)2Eq(xtx0)[x^θ(xt)x02].\boldsymbol{\theta}^*=\underset{\boldsymbol{\theta}}{\operatorname*{argmin}}\sum_{t=1}^T\frac1{2\sigma_q^2(t)}\frac{(1-\alpha_t)^2\overline{\alpha}_{t-1}}{(1-\overline{\alpha}_t)^2}\mathbb{E}_{q(\mathbf{x}_t|\mathbf{x}_0)}\bigg[\left\|\widehat{\mathbf{x}}_{\boldsymbol{\theta}}(\mathbf{x}_t)-\mathbf{x}_0\right\|^2\bigg].

DDPM_7 DDPM_8

Inference: recursively DDPM_10 DDPM_9

8 Derivation based on Noise Vector#

We can show that

μq(xt,x0)=1αtxt1αt1αtαtϵ0.\begin{aligned} \boldsymbol{\mu}_q(\mathbf{x}_t,\mathbf{x}_0) =\frac1{\sqrt{\alpha_t}}\mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\overline{\alpha}_t}\sqrt{\alpha}_t}\boldsymbol{\epsilon}_0. \end{aligned}

So we can design our mean estimator μθ\boldsymbol{\mu_{\theta}} with the form:

μθ(xt)=1αtxt1αt1αtαtϵ^θ(xt).\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_t)=\frac{1}{\sqrt{\alpha_t}}\mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\overline{\alpha}_t}\sqrt{\alpha}_t}\widehat{\boldsymbol{\epsilon}}_{\boldsymbol{\theta}}(\mathbf{x}_t).

Then we can get a new ELBO

ELBOθ=t=1TEq(xtx0)[12σq2(t)(1αt)2αt1(1αt)2ϵ^θ(xt)ϵ02].\mathrm{ELBO}_{\boldsymbol{\theta}}=-\sum_{t=1}^T\mathbb{E}_{q(\mathbf{x}_t|\mathbf{x}_0)}\Big[\frac{1}{2\sigma_q^2(t)}\frac{(1-\alpha_t)^2\overline{\alpha}_{t-1}}{(1-\overline{\alpha}_t)^2}\left\|\widehat{\boldsymbol{\epsilon}}_{\boldsymbol{\theta}}(\mathbf{x}_t)-\boldsymbol{\epsilon}_0\right\|^2\Big].

DDPM_11 Consequently, the inference step can be derived through

xt1pθ(xt1xt)=N(xt1μθ(xt),σq2(t)I)=μθ(xt)+σq2(t)z=1αtxt1αt1αtαtϵ^θ(xt)+σq(t)z=1αt(xt1αt1αtϵ^θ(xt))+σq(t)z\begin{aligned} \mathbf{x}_{t-1}\sim p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}\mid\mathbf{x}_{t})& =\mathcal{N}(\mathbf{x}_{t-1}\mid\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_t),\sigma_q^2(t)\mathbf{I}) \\ &=\mu_{\boldsymbol{\theta}}(\mathbf{x}_t)+\sigma_q^2(t)\mathbf{z} \\ &=\frac{1}{\sqrt{\alpha_{t}}}\mathbf{x}_{t}-\frac{1-\alpha_{t}}{\sqrt{1-\overline{\alpha}_{t}}\sqrt{\alpha}_{t}}\widehat{\boldsymbol{\epsilon}}_{\boldsymbol{\theta}}(\mathbf{x}_{t})+\sigma_{q}(t)\mathbf{z} \\ &=\frac1{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{1-\alpha_t}{\sqrt{1-\overline{\alpha}_t}}\widehat{\boldsymbol{\epsilon}}_{\boldsymbol{\theta}}(\mathbf{x}_t)\right)+\sigma_q(t)\mathbf{z} \end{aligned}

So we have DDPM_12

Reference#

[1] Chan, Stanley H. “Tutorial on Diffusion Models for Imaging and Vision.” arXiv preprint arXiv:2403.18103 (2024).

Denoising Diffusion Probabilistic Model (DDPM)
https://fuwari.vercel.app/posts/denoising_diffusion_probabilistic_model-ddpm/
作者
pride7
发布于
2024-06-18
许可协议
CC BY-NC-SA 4.0