Adversarial Audio Synthesis (ICLR 2019) sound examples

Chris Donahue, Julian McAuley, Miller Puckette

We present sound examples from our WaveGAN and SpecGAN models (paper, code). Each sound file represents fifty examples of one second in length concatenated together, with a half second of silence after each example. All models are trained in the unsupervised setting and results here are a random sampling of fifty latent vectors.

To generate more examples with these same models, see our interactive notebook on Google Colab. We also built a browser-based procedural drum machine out of our WaveGAN trained on drums.

Speech Commands Zero through Nine (SC09)

The SC09 dataset, a subset of the Speech Commands dataset (license), has many speakers and a ten word vocabulary. When trained on this dataset without label conditioning, our WaveGAN and SpecGAN models learn to generate coherent words. Results are arranged into numerical ordering by post-hoc labeling of random examples by the classifier discussed in the paper.

Bird vocalizations

Bird vocalizations collected from Peter Boesman (license)


Single drum hits from drum machines collected from here


Professional pianist playing a variety of Bach compositions (original dataset collected for this work by the authors) Bonus: With our updated codebase supporting longer audio slices, we trained a WaveGAN on four-second slices from Art Tatum's Solo Masterpieces, Vol 1-8.

Large vocabulary speech (TIMIT)

The TIMIT dataset consists of many speakers reading English sentences. Recording conditions were carefully controlled and the utterances have much less noise than those from SC09.

Real data resynthesized with Griffin-Lim

The SpecGAN model operates on lossy audio spectrograms. As a point of comparison, we provide examples of the real data projected into this domain and resynthesized back into audio. This is useful to gauge roughly how much distortion may be caused by the audio feature represention vs. the SpecGAN generative process itself.

Quantitative evaluation experiments

We also provide examples for all models in Table 1 of our paper.

Comparison to existing methods

WARNING: Loud volume

On the SC09 dataset, we also compare to two other methods, WaveNet (van den Oord et al. 2016) and SampleRNN (Mehri et al. 2017) that learn to generate audio in the unsupervised setting. While these implementations are known to produce excellent audio when trained on longer speech slices captured under clean recording conditions, none appear to produce semantically-meaningful results for our single-word SC09 dataset.


If you reference our work in your research, cite via the following BibTeX:

  title={Adversarial Audio Synthesis},
  author={Donahue, Chris and McAuley, Julian and Puckette, Miller},