Demonstration Page

SteerMusic: Enhanced Musical Consistency for Zero-shot Text-Guided and Personalized Music Editing

Abstract

Music editing is an important step in music production, which has broad applications, including game development and film production. Most existing zero-shot text-guided methods rely on pretrained diffusion models by involving forward-backward diffusion processes for editing. However, these methods often struggle to maintain the music content consistency. Additionally, text instructions alone usually fail to accurately describe the desired music. In this paper, we propose two music editing methods that enhance the consistency between the original and edited music by leveraging score distillation. The first method, SteerMusic, is a coarse-grained zero-shot editing approach using delta denoising score. The second method, SteerMusic+, enables fine-grained personalized music editing by manipulating a concept token that represents a user-defined musical style. SteerMusic+ allows for the editing of music into any user-defined musical styles that cannot be achieved by the text instructions alone. Experimental results show that our methods outperform existing approaches in preserving both music content consistency and editing fidelity. User studies further validate that our methods achieve superior music editing quality.


[Supplementary Material]

Contents


Part 1. SteerMusic: Zero-shot Text-guided Music Editing

This part contains demonstration of our SteerMusic method on zero-shot text-guided music editing task.

Source Music

Source Prompt

A recording of piano music with no percussion propeller sound and male narrative voice at the end at a moderate tempo.

Target Prompt

A recording of trumpet music with no percussion propeller sound and male narrative voice at the end at a moderate tempo.



SteerMusic

MusicMagus

ZETA

SDEdit

DDIM

Source Music

Source Prompt

A recording featuring acoustic guitar mellow piano with passionate male vocals and crowd noises.

Target Prompt

A recording featuring acoustic guitar mellow harp with passionate male vocals and crowd noises.



SteerMusic

MusicMagus

ZETA

SDEdit

DDIM

Source Music

Source Prompt

A lively indian wedding band playing bhangra music with brass band and percussive dhol.

Target Prompt

A lively indian wedding band playing bhangra music with brass band and percussive drums.



SteerMusic

MusicMagus

ZETA

SDEdit

DDIM

Source Music

Source Prompt

A recording featuring a wide resonating woodwind melody with audio crackles.

Target Prompt

A recording featuring a wide resonating trumpet melody with audio crackles.



SteerMusic

MusicMagus

ZETA

SDEdit

DDIM

Source Music

Source Prompt

A delicate optimistic acoustic guitar instrumental.

Target Prompt

A delicate optimistic acoustic harp instrumental.



SteerMusic

MusicMagus

ZETA

SDEdit

DDIM

Source Music

Source Prompt

A recording of an uptempo country song with acoustic rhythm guitar passionate male vocal compressed electric guitar melody.

Target Prompt

A recording of an uptempo rock song with acoustic rhythm guitar passionate male vocal compressed electric guitar melody.



SteerMusic

MusicMagus

ZETA

SDEdit

DDIM

Source Music

Source Prompt

Melancholic 2000s rock with electric guitars electric bass and acoustic drums at a slow tempo.

Target Prompt

Melancholic 2000s jazz with electric guitars electric bass and acoustic drums at a slow tempo.



SteerMusic

MusicMagus

ZETA

SDEdit

DDIM

Source Music

Source Prompt

A recording of retro pop with string and piano accompaniment.

Target Prompt

A recording of retro jazz with string and piano accompaniment.



SteerMusic

MusicMagus

ZETA

SDEdit

DDIM

Source Music

Source Prompt

A modern rock recording featuring smooth male vocals steady ride cymbal pattern snare with brushes and fingerstyle acoustic guitar.

Target Prompt

A modern jazz recording featuring smooth male vocals steady ride cymbal pattern snare with brushes and fingerstyle acoustic guitar.



SteerMusic

MusicMagus

ZETA

SDEdit

DDIM

Source Music

Source Prompt

A jewish gospel song with male vocal acoustic instruments and emotional devotional theme.

Target Prompt

A jewish gospel song with male vocal acoustic instruments and joyful devotional theme.



SteerMusic

MusicMagus

ZETA

SDEdit

DDIM

Source Music

Source Prompt

A lively cheerful instrumental piece featuring classical strings and horn.

Target Prompt

A melancholy cheerful instrumental piece featuring classical strings and horn.



SteerMusic

MusicMagus

ZETA

SDEdit

DDIM

Source Music

Source Prompt

An emotional foreign language pop song with steady drumming keyboard acoustic guitar and violin harmonies.

Target Prompt

An exuberant foreign language pop song with steady drumming keyboard acoustic guitar and violin harmonies.



SteerMusic

MusicMagus

ZETA

SDEdit

DDIM


Part 2. SteerMusic+: Personalized Music Editing

This part contains demonstration of our SteerMusic+ method on personalized music editing task.

Source Music Target Concept SteerMusic+ DreamSound Textual Inversion
"A relaxing reggae track with acoustic drums bass keys guitar and female vocals."
[Reggae]
"A relaxing [reggae] track with acoustic drums bass keys guitar and female vocals."
"A relaxing [reggae] track with acoustic drums bass keys guitar and female vocals."
"A relaxing [reggae] track with acoustic drums bass keys guitar and female vocals."
"A recording of fingerstyle acoustic guitar with a classical atmosphere."
[Sitar]
"A recording of fingerstyle acoustic [Sitar] with a classical atmosphere."
"A recording of fingerstyle acoustic [Sitar] with a classical atmosphere."
"A recording of fingerstyle acoustic [Sitar] with a classical atmosphere."
"A guitar tutorial with energetic technique and ambient noises."
[Bouzouki]
"A [bouzouki] tutorial with energetic technique and ambient noises."
"A [bouzouki] tutorial with energetic technique and ambient noises."
"A [bouzouki] tutorial with energetic technique and ambient noises."
"A recording of an intense rhythmic drum battle"
[Bouzouki]
"A recording of an intense rhythmic [bouzouki] battle"
"A recording of an intense rhythmic [bouzouki] battle"
"A recording of an intense rhythmic [bouzouki] battle"
"A lively country tune featuring banjo acoustic piano violin and upright bass."
[Sitar]
"A lively country tune featuring banjo acoustic [sitar] violin and upright bass."
"A lively country tune featuring banjo acoustic [sitar] violin and upright bass."
"A lively country tune featuring banjo acoustic [sitar] violin and upright bass."
"A recording of renaissance music with soft wooden percussions and a mellow harmonized flute melody."
[Ocarina]
"A recording of renaissance music with soft wooden percussions and a mellow harmonized [ocarina] melody."
"A recording of renaissance music with soft wooden percussions and a mellow harmonized [ocarina] melody."
"A recording of renaissance music with soft wooden percussions and a mellow harmonized [ocarina] melody."
"A melancholic pop song with acoustic piano strings electronic bass and female vocals."
[Ocarina]
"A melancholic pop song with acoustic [ocarina] strings electronic bass and female vocals."
"A melancholic pop song with acoustic [ocarina] strings electronic bass and female vocals."
"A melancholic pop song with acoustic [ocarina] strings electronic bass and female vocals."
"A tranquil complex jazz live performance featuring instrumental improvisation on organ saxophone bass guitar and acoustic drums."
[Morricone]
"A tranquil complex [morricone] live performance featuring instrumental improvisation on organ saxophone bass guitar and acoustic drums."
"A tranquil complex [morricone] live performance featuring instrumental improvisation on organ saxophone bass guitar and acoustic drums."
"A tranquil complex [morricone] live performance featuring instrumental improvisation on organ saxophone bass guitar and acoustic drums."
"A recording featuring an amateur DJ performance with turntable scratching and electronic drums."
[Reggae]
"A recording featuring an amateur DJ performance with turntable scratching and [reggae] drums."
"A recording featuring an amateur DJ performance with turntable scratching and [reggae] drums."
"A recording featuring an amateur DJ performance with turntable scratching and [reggae] drums."
"A recording of eclectic rebellious rock music featuring an electric guitar solo keyboard bass guitar and acoustic drums."
[Hiphop]
"A recording of eclectic rebellious [hiphop] music featuring an electric guitar solo keyboard bass guitar and acoustic drums."
"A recording of eclectic rebellious [hiphop] music featuring an electric guitar solo keyboard bass guitar and acoustic drums."
"A recording of eclectic rebellious [hiphop] music featuring an electric guitar solo keyboard bass guitar and acoustic drums."
"A recording of rock and roll with electric guitar bass guitar acoustic drums and male vocals."
[Hiphop]
"A recording of [hiphop] and roll with electric guitar bass guitar acoustic drums and male vocals."
"A recording of [hiphop] and roll with electric guitar bass guitar acoustic drums and male vocals."
"A recording of [hiphop] and roll with electric guitar bass guitar acoustic drums and male vocals."
"A mellow rock piece featuring two guitars and a sultry female singer."
[Sarabande]
"A mellow [sarabande] piece featuring two guitars and a sultry female singer."
"A mellow [sarabande] piece featuring two guitars and a sultry female singer."
"A mellow [sarabande] piece featuring two guitars and a sultry female singer."

Reference

  1. [MusicMagus] Zhang, Y., Ikemiya, Y., Xia, G., Murata, N., Martínez-Ramírez, M. A., Liao, W. H., ... & Dixon, S. (2024, August). MusicMagus: zero-shot text-to-music editing via diffusion models. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (pp. 7805-7813).
  2. [ZETA] Manor, H., & Michaeli, T. (2024, July). Zero-shot unsupervised and text-based audio editing using DDPM inversion. In Proceedings of the 41st International Conference on Machine Learning (pp. 34603-34629).
  3. [DDIM] Song, J., Meng, C., & Ermon, S. Denoising Diffusion Implicit Models. In International Conference on Learning Representations.
  4. [SDEdit] Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J. Y., & Ermon, S. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In International Conference on Learning Representations.
  5. [DreamSound & Textual Inversion] Plitsis, M., Kouzelis, T., Paraskevopoulos, G., Katsouros, V., & Panagakis, Y. (2024, April). Investigating personalization methods in text to music generation. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1081-1085). IEEE.