Recent advances in deep learning have introduced powerful new tools for audio synthesis, providing musicians with unprecedented means to explore and manipulate soundscapes. In particular, the rise of generative models has opened new creative perspectives and enabled novel forms of musical expression. Despite their success, one critical limitation of these models is their lack of expressive control, making it difficult for musicians to willingfully guide the generation.
To address this issue, researchers have explored various strategies to control generative models based on user-specified inputs. Amongst different approaches, diffusion models have emerged as a promising solution to synthesize high-quality data, while also offering robust conditioning mechanisms to guide the generative process. Such models have led to numerous new applications, notably in the audio domain, with the recent advent of text-to-music models such as Stable Audio 2 [1], where users can generate an entire song from a short textual description. While such applications illustrate the complex control possibilities offered by diffusion models, they yet fail to capture the full depth of musical intentions and have a very limited artistic reach. Indeed, the low expressiveness of small text prompts, as well as their static nature, comes in contradiction with the idea of precisely manipulating complex attributes over time, which is required to integrate neural networks within novel musical instruments. Consequently, multiple approaches have tried to condition generative models with more creatively adapted controls, such as melody or rhythm [2]. However, these approaches still fail to account for the inherent time and frequency hierarchies within musical compositions. Indeed, music is structured across multiple temporal scales, where short-term information, such as pitch or onsets, intertwine and lead to intricate patterns, such as melody or groove, which eventually give rise to complex long-term dependencies like structure or mood. Ignoring these hierarchical elements limits the expressive power of generative models for audio, as they do not fully capture the layered complexity of music. We believe that incorporating this hierarchical nature offers a more meaningful and sophisticated approach to control music generation.
This research aims to create a tool for music composition by enabling finer control over audio synthesis. We state that current architectures for images manipulation such as Hierarchical Diffusion Auto-Encoders (HDAE) [3] are suboptimal for audio applications. Consequently, we propose a novel hierarchical diffusion model that leverages a specific architecture and training scheme based on these considerations. This model incorporates multiple encoders to capture distinct levels of musical abstraction and employs a progressive training approach. Finally, we asses the effectiveness of our model on a custom evaluation framework, including various temporal ranges for tasks through feature manipulation and interpretability experiments.
Complete workflow of our model proposal
[1] Zach Evans, Julian D Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. Long-form music generation with latent diffusion. arXiv preprint arXiv:2404.10301, 2024. [2] Shih-Lun Wu, Chris Donahue, Shinji Watanabe, and Nicholas J. Bryan. Music controlnet: Multiple time-varying controls for music generation, 2023. [3] Zeyu Lu, Chengyue Wu, Xinyuan Chen, Yaohui Wang, Lei Bai, Yu Qiao, and Xihui Liu. Hierarchical diffusion autoencoders and disentangled image manipulation, 2023.