V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

Abstract

We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types. Trained on 5,000 hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general-purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality.

Publication
In the Proceedings of the AAAI Conference on Artificial Intelligence
Chris Donahue
Chris Donahue
Dannenberg Assistant Professor