Adversarial Audio Synthesis (ICLR 2019) sound examples

Chris Donahue, Julian McAuley, Miller Puckette

We present sound examples from our WaveGAN and SpecGAN models (paper, code). Each sound file represents fifty examples of one second in length concatenated together, with a half second of silence after each example. All models are trained in the unsupervised setting and results here are a random sampling of fifty latent vectors.

To generate more examples with these same models, see our interactive notebook on Google Colab. We also built a browser-based procedural drum machine out of our WaveGAN trained on drums.

Speech Commands Zero through Nine (SC09)

The SC09 dataset, a subset of the Speech Commands dataset (license), has many speakers and a ten word vocabulary. When trained on this dataset without label conditioning, our WaveGAN and SpecGAN models learn to generate coherent words. Results are arranged into numerical ordering by post-hoc labeling of random examples by the classifier discussed in the paper.

Real data
WaveGAN
SpecGAN

Bird vocalizations

Bird vocalizations collected from Peter Boesman (license)

Real data
WaveGAN
SpecGAN

Drums

Single drum hits from drum machines collected from here

Real data
WaveGAN
SpecGAN

Piano

Professional pianist playing a variety of Bach compositions (original dataset collected for this work by the authors)

Real data
WaveGAN
SpecGAN

Bonus: With our updated codebase supporting longer audio slices, we trained a WaveGAN on four-second slices from Art Tatum's Solo Masterpieces, Vol 1-8.

Real data
WaveGAN

Large vocabulary speech (TIMIT)

The TIMIT dataset consists of many speakers reading English sentences. Recording conditions were carefully controlled and the utterances have much less noise than those from SC09.

Real data
WaveGAN
SpecGAN

Real data resynthesized with Griffin-Lim

The SpecGAN model operates on lossy audio spectrograms. As a point of comparison, we provide examples of the real data projected into this domain and resynthesized back into audio. This is useful to gauge roughly how much distortion may be caused by the audio feature represention vs. the SpecGAN generative process itself.

SC09
Birds
Drums
Piano
Timit

Quantitative evaluation experiments

We also provide examples for all models in Table 1 of our paper.

Real data
Parametric (Buchanan 2017)
WaveGAN
WaveGAN + phase shuffle (n=2)
WaveGAN + phase shuffle (n=4)
WaveGAN + nearest neighbor upsampling
WaveGAN + linear upsampling
WaveGAN + cubic upsampling
WaveGAN + post-processing filter
WaveGAN + dropout
SpecGAN
SpecGAN + phase shuffle (n=1)

Comparison to existing methods

WARNING: Loud volume

On the SC09 dataset, we also compare to two other methods, WaveNet (van den Oord et al. 2016) and SampleRNN (Mehri et al. 2017) that learn to generate audio in the unsupervised setting. While these implementations are known to produce excellent audio when trained on longer speech slices captured under clean recording conditions, none appear to produce semantically-meaningful results for our single-word SC09 dataset.

WaveNet (van den Oord et al. 2016) public implementation 1 (link)
WaveNet (van den Oord et al. 2016) public implementation 2 (link)
SampleRNN (Mehri et al. 2017) official implementation 1 (link)

Attribution

If you reference our work in your research, cite via the following BibTeX:

@inproceedings{donahue2019wavegan,
  title={Adversarial Audio Synthesis},
  author={Donahue, Chris and McAuley, Julian and Puckette, Miller},
  booktitle={ICLR},
  year={2019}
}