Training a latent diffusion model from scratch

Follow the full discussion on Reddit.
I am training a latent diffusion model from scratch using my own custom architecture. I have trained a VAE that downsamples the images of shape 32x32 to shape 16x16 (I know this seems dumb but this is an oversimplification of the process I am using). I am currently training the UNET, however while it fits relatively well to the latents (which is what I am training it on) the decoded output from the VAE is often low quality due to the small inaccuracies in the latents being magnified by the decoder (bear in mind the VAE is frozen so shouldn't change while the UNET is training). Would I be better off calculating the loss based on the decoded predictions rather than the latent predictions?

Comments

There's unfortunately not much to read here yet...

Discover the Best of Machine Learning.

Ever having issues keeping up with everything that's going on in Machine Learning? That's where we help. We're sending out a weekly digest, highlighting the Best of Machine Learning.

Join over 900 Machine Learning Engineers receiving our weekly digest.

Best of Machine LearningBest of Machine Learning

Discover the best guides, books, papers and news in Machine Learning, once per week.

Twitter