In this project, we developed a one-shot multi-speaker text-to-speech system using a novel transformer architecture, where we incorporate scaled speaker embeddings at different stages of the transformer. This enables us to synthesize speech in the voice of any target speaker, given only a 5-second clip of their voice. You can watch the presentation video below and access the colab notebook [here] if you are interested in the code.
vrishbhanu28
Kommentare