GAZEV: GAN-Based Zero Shot Voice Conversion over Non-parallel Speech Corpus

Abstract

Non-parallel many-to-many voice conversion is recently attracting huge research efforts in speech processing community. A voice conversion system transforms an utterance of a source speaker to the utterance of a target speaker by maintaining the content in the original utterance and replacing the vocal features with the target speaker. Existing solutions, e.g., StarGAN-VC and AutoVC, present promising results, when speech corpus of the speakers is generally available during model training. Such approaches may not generate speech conversion results with desirable quality, when an unseen speaker is involved in the inference procedure. In this paper, we present GAZEV, our new GAN-based zero shot voice conversion solution, which target to support unseen speakers on both source and target utterance. Our key technical contribution is the adoption of new instance normalization strategy and speaker embedding loss on top of GAN framework, in order to address the limitations of speaking style transfer in existing solutions. Our empirical evaluations demonstrate significant performance improvement, on output speech quality and speaker similarity.

Demos

In AUTOVC's demo page [link], there are only short sentences (1 second).

To get a better idea of the performance difference, some longer sentences are sampled here.

Also notice that, the AUTOVC here is trained on 80 speakers, instead of 40 in the original paper

Conversion Direction Source Target AUTOVC [link] GAZEV (proposed)
Seen-Seen
Seen-Unseen
Unseen-Seen
Unseen-UnSeen