Section I. Performance of our melody transcription methods and baselines on RWC-MDB
This section serves as the primary holistic comparison of our approach to melody transcription against existing baselines. Our method involves training Transformers (Vaswani et al. 17) on a dataset of crowdsourced melody annotations derived from HookTheory. We compare the efficacy of using different music audio feature representations as inputs to these models. Specifically, we examine:
Below we compare performance of our three models (bottom) to results from four other melody transcription baselines (middle) on ten segments from RWC-MDB (Goto et al. 02). These particular segments were chosen to enable comparison to Ryynänen and Klapuri 08, which released transcriptions for these specific segments. The four baselines are:
F1 |
---|
Note that our method benefits from our new melody transcription dataset, which is substantially larger than that of past efforts. Hence, the stronger performance of "Mel + Transformer" compared to the baselines may be interpreted as the benefit of collecting a large dataset for this task, and the yet stronger performance of "MT3 + Transformer" and "Jukebox + Transformer" as the additional benefit of leveraging pre-trained models.
Section II. Comparing different pre-trained representations on HookTheory
In addition to evaluating our methods on a small set of ten segments, we also compute performance on the entire HookTheory test set, which contains over a thousand human-labeled segments. We also compare the benefit of combining different input features (bottom). Our results show that features are complementary to a degree—the strongest performance is obtained by combining all three features—but the benefits are marginal compared to using Jukebox alone.
F1 |
---|
Section III. Refining alignments from HookTheory
One challenge of using HookTheory for melody transcription is that the user-specified alignments between the score and the audio are crude. Users only provide a timestamp of the start and end of their annotated segment within the audio track—these timestamps are often approximate, and any tempo deviations within the segment will further distort the alignment.
To refine these alignments, we propose a strategy which relies on beat and downbeat detections from madmom (Böck et al. 16). Our approach first aligns the first segment downbeat to the detected which is closest to the user-specified starting timestamp. Then, remaining segment beats are aligned to the subsequent detected beats. Some examples follow:
In an informal listening test, we find that our refinement strategy improved the alignment for 95 out of 100 segments. The primary failure mode occurs when madmom detects the wrong beat as the downbeat . Additionally, on occasion the starting timestamp from the user will be so imprecise that the transcription gets aligned to the wrong measure: . In practice, it appears to be possible to train effective transcription models using our refined alignments despite occasional hiccups.
(Bonus!) Section IV. Lead sheet transcription results on HookTheory
As a bonus, we present results from Sheet Sage, a system we built to automatically convert music audio into lead sheets, powered by our Jukebox-based melody transcription model. To build Sheet Sage, we trained a Jukebox-based chord recognition model on the chord annotations from HookTheory. To render lead sheets, we combined our melody transcription and chord recognition models with beat detections from madmom (Böck et al. 16) and symbolic key estimation using Krumhansl-Schmuckler (Krumhansl 90). See our paper for full details.
Below we make a qualitative comparison between lead sheets from Sheet Sage and human-transcribed lead sheets from HookTheory. We highlight songs for which our method does well (🍒) as well as songs representative of different failure modes of our approach (🍋).
🍒 Good | |
🍋 Bad | |