We present sound examples from our WaveGAN and SpecGAN models (paper, code). Each sound file represents fifty examples of one second in length concatenated together, with a half second of silence after each example. All models are trained in the unsupervised setting and results here are a random sampling of fifty latent vectors.
The SC09 dataset, a subset of the Speech Commands dataset (license), has many speakers and a ten word vocabulary. When trained on this dataset without label conditioning, our WaveGAN and SpecGAN models learn to generate coherent words. Results are arranged into numerical ordering by post-hoc labeling of random examples by the classifier discussed in the paper.
The TIMIT dataset consists of many speakers reading English sentences. Recording conditions were carefully controlled and the utterances have much less noise than those from SC09.
Real data
WaveGAN
SpecGAN
Real data resynthesized with Griffin-Lim
The SpecGAN model operates on lossy audio spectrograms. As a point of comparison, we provide examples of the real data projected into this domain and resynthesized back into audio. This is useful to gauge roughly how much distortion may be caused by the audio feature represention vs. the SpecGAN generative process itself.
SC09
Birds
Drums
Piano
Timit
Quantitative evaluation experiments
We also provide examples for all models in Table 1 of our paper.
Real data
Parametric (Buchanan 2017)
WaveGAN
WaveGAN + phase shuffle (n=2)
WaveGAN + phase shuffle (n=4)
WaveGAN + nearest neighbor upsampling
WaveGAN + linear upsampling
WaveGAN + cubic upsampling
WaveGAN + post-processing filter
WaveGAN + dropout
SpecGAN
SpecGAN + phase shuffle (n=1)
Comparison to existing methods
WARNING: Loud volume
On the SC09 dataset, we also compare to two other methods, WaveNet (van den Oord et al. 2016) and SampleRNN (Mehri et al. 2017) that learn to generate audio in the unsupervised setting. While these implementations are known to produce excellent audio when trained on longer speech slices captured under clean recording conditions, none appear to produce semantically-meaningful results for our single-word SC09 dataset.
WaveNet (van den Oord et al. 2016) public implementation 1 (link)
WaveNet (van den Oord et al. 2016) public implementation 2 (link)
SampleRNN (Mehri et al. 2017) official implementation 1 (link)
Attribution
If you reference our work in your research, cite via the following BibTeX:
@inproceedings{donahue2019wavegan,
title={Adversarial Audio Synthesis},
author={Donahue, Chris and McAuley, Julian and Puckette, Miller},
booktitle={ICLR},
year={2019}
}