On Thursday, a pair of tech hobbyists released Riffusion, an AI model that generates music from text prompts by creating a visual representation of the sound and converting it to audio for playback. It uses a fine-tuned version of the Stable Diffusion 1.5 image synthesis model, which applies visual latent diffusion to sound processing in a new way.
Created as a hobby project by Seth Forsgren and Hayk Martiros, Riffusion works by generating sonograms that store audio in a two-dimensional image. In a sonogram, the X-axis represents time (the order in which the frequencies are played, from left to right), and the Y-axis represents the frequency of the sounds. Meanwhile, the color of each pixel in the image represents the amplitude of the sound at that particular moment in time.
Since a sonogram is a type of image, Stable Diffusion can process it. Forsgren and Martiros trained a custom Stable Diffusion model with example sonograms associated with descriptions of the sounds or musical genres they represented. With that knowledge, Riffusion can generate new music on the fly based on text prompts that describe the type of music or sound you want to hear, such as “jazz,” “rock,” or even typing on a keyboard.
After the sonogram image has been generated, Riffusion uses Torchaudio to convert the sonogram to sound, playing it as audio.
“This is the v1.5 stable diffusion model with no changes, just tuned to images of spectrograms paired with text,” the creators of Riffusion write on its explanation page. “It can generate infinite variations of a prompt by varying the seed. All the same web UIs and techniques like img2img, inpainting, negative prompts and interpolation work out of the box.”
Visitors to the Riffusion website can experiment with the AI model thanks to an interactive web app that generates interpolated sonograms (smoothly stitched together for uninterrupted playback) in real time while continuously visualizing the spectrogram on the left side of the page.
It can also merge styles. For example, types in “smooth tropical dance jazz” brings elements of different genres for a novel result, encouraging experiments by mixing styles.
Of course, Riffusion isn’t the first AI-powered music generator. Earlier this year, Harmonai released Dance Diffusion, an AI-powered generative music model. OpenAI’s Jukebox, announced in 2020, also generates new music using a neural network. And websites like Soundraw create music non-stop on the fly.
Compared to the more streamlined AI music efforts, Riffusion feels more like the hobby project it is. The music it generates ranges from interesting to incomprehensible, but it remains a remarkable application of latent diffusion technology that manipulates audio in a visual space.
The Riffusion model checkpoint and code are available on GitHub.