Suggestions in speaker embedding networks

Follow the full discussion on Reddit.
I am trying to implement speaker adaptation, so I have been experimenting with all different approaches the last month. The most promising proposal I have seen comes from Cooper et al. (2019) and they propose that using an independent network to train and get speaker embeddings might be better, since we may be able to get more and clearer embeddings that we can then use with synthesisers. I attempted to do this by training a multi-speaker model on VCTK for approximately 950.000 steps and then use the embeddings from that checkpoint to train a new model on an unseen speaker, but I did not get good results (although I suspect the culprit for that was the extremely low amount of training data I had - the model was actually able to adapt to the voice of the speaker quite fast). I was wondering if anyone has any suggestions of pipelines I could look into for this specific objective (which I would also like to base my Master's thesis on). I have tried to leverage this one, which looks extremely promising, but conda does not want to cooperate and refuses to download libs such as scipy etc.

Comments

There's unfortunately not much to read here yet...

Discover the Best of Machine Learning.

Ever having issues keeping up with everything that's going on in Machine Learning? That's where we help. We're sending out a weekly digest, highlighting the Best of Machine Learning.

Join over 900 Machine Learning Engineers receiving our weekly digest.

Best of Machine LearningBest of Machine Learning

Discover the best guides, books, papers and news in Machine Learning, once per week.

Twitter