Expediting TTS Synthesis with Adversarial Vocoding

*Paarth Neekhara, *Chris Donahue, Miller Puckette, Shlomo Dubnov, Julian McAuley

We present sound examples for the experiments in our paper Expediting TTS Synthesis with Adversarial Vocoding (paper, code). Sound files on this page were synthesized by vocoding mel spectrograms using various methods including our own. These mel spectrograms were extracted either from real waveforms (from the LJ Speech Dataset) or synthesized by a state-of-the-art TTS system (Tacotron-2).

(Table 1) Examining the effects of magnitude and phase estimation heuristics

These examples correspond to Table 1 from our paper. Here we are examining the effects of common techniques used in heuristic-based vocoding of mel spectrograms. Specifically, we mix and match different heuristics for performing magnitude estimation (converting log-frequency mel spectrograms into linear-frequency magnitude spectrograms) and phase estimation (estimating phase for the magnitude spectrogram). Each row in this table is labeled with the two methods used for both estimation problems (e.g. "Mel pseudoinverse + Griffin-Lim" uses the pseudoinverse of the mel basis to perform magnitude estimation, and Griffin-Lim to reconstruct phase for this estimated magnitude).

We observe that coupling an ideal solution (using real data as as proxy) to one subproblem with a reasonable heuristic for the other ("Real mag + LWS" and "Mel pseudoinverse + Real phase") results in reasonable speech. Of the two, we target magnitude estimation in this work and use Local Weighted Sums (LWS) as a phase estimator. Hence, "Real Mag + LWS" represents the upper bound on quality that our method hopes to achieve.

(Table 2 MOS-Real) Vocoding real mel spectrograms

These examples correspond to Table 2 (column MOS-Real) from our paper. Here we take waveforms from the real LJ Speech dataset and extract mel spectrograms. We then vocode these mel spectrograms back to audio using various methods. We do this to compare the quality of vocodings without the additional confounding factor of using synthetic spectrograms. We note that our methods (AdVoc and AdVoc-small) are significantly higher quality than the pseudoinverse heuristic and hundreds of times faster than the autoregressive WaveNet vocoder.

(Table 2 MOS-TTS) Vocoding synthetic spectrograms

These examples correspond to Table 2 (column MOS-TTS) from our paper. Here we use Tacotron-2 to generate synthetic spectrograms corresponding to a real transcript. We then vocode these synthetic spectrograms to waveforms using various methods. This represents a realistic TTS setting used in state-of-the-art systems. Again our methods outperform the pseudoinverse heuristic.

Vocdoing-based TTS on out-of-domain transcriptions

Here we show vocodings of various methods on synthetic spectrograms generated by Tacotron-2. This is intended to demonstrate that these systems can produce reasonable audio for transcripts that are outside of LJ Speech (as we train both Tacotron-2 and AdVoc on this dataset).

(Table 3) Unsupervised generation of small-vocabulary speech

Here we are concerned with unsupervised generation of speech (as opposed to supervised as in TTS). The goal here is to train a system which can learn to speak coherent English words without labels. Specifically, we wish to generate spoken digits (i.e. 10 words "zero" through "nine"). Our previous work attempts to do this directly in the waveform domain (WaveGAN) and also in a naive spectrogram domain combined with Griffin-Lim for heuristic inversion (SpecGAN). Here we show that we can get substantially higher-quality results by first training a GAN to generate mel spectrograms (MelSpecGAN) and then training an adversarial vocoder to vocode those spectrograms into audio.