Section I. Performance of our melody transcription methods and baselines on RWC-MDB

This section serves as the primary holistic comparison of our approach to melody transcription against existing baselines. Our method involves training Transformers (Vaswani et al. 17) on a dataset of crowdsourced melody annotations derived from HookTheory. We compare the efficacy of using different music audio feature representations as inputs to these models. Specifically, we examine:

  1. Mel spectrograms (specifically, this formulation), representative of conventional approaches to transcription
  2. Representations extracted from MT3 (Gardner et al. 21), a model pre-trained on many transcription tasks
  3. Representations extracted from Jukebox (Dhariwal et al. 20), a large model pre-trained to generate music audio

Below we compare performance of our three models (bottom) to results from four other melody transcription baselines (middle) on ten segments from RWC-MDB (Goto et al. 02). These particular segments were chosen to enable comparison to Ryynänen and Klapuri 08, which released transcriptions for these specific segments. The four baselines are:

  1. MT3 (Gardner et al. 21) which was not trained on melody transcription, i.e., it is zero-shot
  2. Combining Melodia (Salamon et al. 12), a melody extraction algorithm, with heuristic note segmentation
  3. Tony (Mauch et al. 15), a monophonic transcription algorithm, on vocals isolated with Spleeter (Hennequin 19)
  4. The melody transcription algorithm from Ryynänen and Klapuri 08 which combines DSP with an HMM
F1

Note that our method benefits from our new melody transcription dataset, which is substantially larger than that of past efforts. Hence, the stronger performance of "Mel + Transformer" compared to the baselines may be interpreted as the benefit of collecting a large dataset for this task, and the yet stronger performance of "MT3 + Transformer" and "Jukebox + Transformer" as the additional benefit of leveraging pre-trained models.

Section II. Comparing different pre-trained representations on HookTheory

In addition to evaluating our methods on a small set of ten segments, we also compute performance on the entire HookTheory test set, which contains over a thousand human-labeled segments. We also compare the benefit of combining different input features (bottom). Our results show that features are complementary to a degree—the strongest performance is obtained by combining all three features—but the benefits are marginal compared to using Jukebox alone.

F1

Section III. Refining alignments from HookTheory

One challenge of using HookTheory for melody transcription is that the user-specified alignments between the score and the audio are crude. Users only provide a timestamp of the start and end of their annotated segment within the audio track—these timestamps are often approximate, and any tempo deviations within the segment will further distort the alignment.

To refine these alignments, we propose a strategy which relies on beat and downbeat detections from madmom (Böck et al. 16). Our approach first aligns the first segment downbeat to the detected which is closest to the user-specified starting timestamp. Then, remaining segment beats are aligned to the subsequent detected beats. Some examples follow:

In an informal listening test, we find that our refinement strategy improved the alignment for 95 out of 100 segments. The primary failure mode occurs when madmom detects the wrong beat as the downbeat . Additionally, on occasion the starting timestamp from the user will be so imprecise that the transcription gets aligned to the wrong measure: . In practice, it appears to be possible to train effective transcription models using our refined alignments despite occasional hiccups.

(Bonus!) Section IV. Lead sheet transcription results on HookTheory

As a bonus, we present results from Sheet Sage, a system we built to automatically convert music audio into lead sheets, powered by our Jukebox-based melody transcription model. To build Sheet Sage, we trained a Jukebox-based chord recognition model on the chord annotations from HookTheory. To render lead sheets, we combined our melody transcription and chord recognition models with beat detections from madmom (Böck et al. 16) and symbolic key estimation using Krumhansl-Schmuckler (Krumhansl 90). See our paper for full details.

Below we make a qualitative comparison between lead sheets from Sheet Sage and human-transcribed lead sheets from HookTheory. We highlight songs for which our method does well (🍒) as well as songs representative of different failure modes of our approach (🍋).

🍒 Good
🍋 Bad