Comments
There's unfortunately not much to read here yet...
Follow the full discussion on Reddit.
I am training a latent diffusion model from scratch using my own custom architecture. I have trained a VAE that downsamples the images of shape 32x32 to shape 16x16 (I know this seems dumb but this is an oversimplification of the process I am using). I am currently training the UNET, however while it fits relatively well to the latents (which is what I am training it on) the decoded output from the VAE is often low quality due to the small inaccuracies in the latents being magnified by the decoder (bear in mind the VAE is frozen so shouldn't change while the UNET is training). Would I be better off calculating the loss based on the decoded predictions rather than the latent predictions?
There's unfortunately not much to read here yet...
Ever having issues keeping up with everything that's going on in Machine Learning? That's where we help. We're sending out a weekly digest, highlighting the Best of Machine Learning.
Discover the best guides, books, papers and news in Machine Learning, once per week.