Suggestions in speaker embedding networks

Follow the full discussion on Reddit.
I am trying to implement speaker adaptation, so I have been experimenting with all different approaches the last month. The most promising proposal I have seen comes from Cooper et al. (2019) and they propose that using an independent network to train and get speaker embeddings might be better, since we may be able to get more and clearer embeddings that we can then use with synthesisers. I attempted to do this by training a multi-speaker model on VCTK for approximately 950.000 steps and then use the embeddings from that checkpoint to train a new model on an unseen speaker, but I did not get good results (although I suspect the culprit for that was the extremely low amount of training data I had - the model was actually able to adapt to the voice of the speaker quite fast). I was wondering if anyone has any suggestions of pipelines I could look into for this specific objective (which I would also like to base my Master's thesis on). I have tried to leverage this one, which looks extremely promising, but conda does not want to cooperate and refuses to download libs such as scipy etc.

Visit Website

Discover the Best of Machine Learning.

Ever having issues keeping up with everything that's going on in Machine Learning? That's where we help. We're sending out a weekly digest, highlighting the Best of Machine Learning.

Suggestions in speaker embedding networks

Comments

Discover the Best of Machine Learning.