Training a latent diffusion model from scratch

Follow the full discussion on Reddit.
I am training a latent diffusion model from scratch using my own custom architecture. I have trained a VAE that downsamples the images of shape 32x32 to shape 16x16 (I know this seems dumb but this is an oversimplification of the process I am using). I am currently training the UNET, however while it fits relatively well to the latents (which is what I am training it on) the decoded output from the VAE is often low quality due to the small inaccuracies in the latents being magnified by the decoder (bear in mind the VAE is frozen so shouldn't change while the UNET is training). Would I be better off calculating the loss based on the decoded predictions rather than the latent predictions?

Visit Website

Discover the Best of Machine Learning.

Ever having issues keeping up with everything that's going on in Machine Learning? That's where we help. We're sending out a weekly digest, highlighting the Best of Machine Learning.

Training a latent diffusion model from scratch

Comments

Discover the Best of Machine Learning.