Riffusion’s AI generates music from text using visual sonograms

An AI-generated image of musical notes exploding from a computer monitor.
Enlarge / An AI-generated image of musical notes exploding from a computer monitor.

Ars Technica

On Thursday, a pair of tech hobbyists released Riffusion, an AI model that generates music from text prompts by creating a visual representation of the sound and converting it to audio for playback. It uses a fine-tuned version of the Stable Diffusion 1.5 image synthesis model, which applies visual latent diffusion to sound processing in a new way.

Created as a hobby project by Seth Forsgren and Hayk Martiros, Riffusion works by generating sonograms that store audio in a two-dimensional image. In a sonogram, the X-axis represents time (the order in which the frequencies are played, from left to right), and the Y-axis represents the frequency of the sounds. Meanwhile, the color of each pixel in the image represents the amplitude of the sound at that particular moment in time.

Also Read :  Jamba by Blendid Expands Robotic Smoothie Kiosk Offering, Opens Pilot Kiosk at Grady Memorial Hospital in Atlanta

Since a sonogram is a type of image, Stable Diffusion can process it. Forsgren and Martiros trained a custom Stable Diffusion model with example sonograms associated with descriptions of the sounds or musical genres they represented. With that knowledge, Riffusion can generate new music on the fly based on text prompts that describe the type of music or sound you want to hear, such as “jazz,” “rock,” or even typing on a keyboard.

After the sonogram image has been generated, Riffusion uses Torchaudio to convert the sonogram to sound, playing it as audio.

Also Read :  Monster Hunter Rise coming to PS5, Xbox Series, PS4, and Xbox One on January 20, 2023
A sonogram represents time, frequency and amplitude in a two-dimensional image.
Enlarge / A sonogram represents time, frequency and amplitude in a two-dimensional image.

“This is the v1.5 stable diffusion model with no changes, just tuned to images of spectrograms paired with text,” the creators of Riffusion write on its explanation page. “It can generate infinite variations of a prompt by varying the seed. All the same web UIs and techniques like img2img, inpainting, negative prompts and interpolation work out of the box.”

Visitors to the Riffusion website can experiment with the AI ​​model thanks to an interactive web app that generates interpolated sonograms (smoothly stitched together for uninterrupted playback) in real time while continuously visualizing the spectrogram on the left side of the page.

Also Read :  CAFC Expressly States Patentee Disclaimers During IPR are Not Binding on the PTAB’s Patentability Analysis
A screenshot of the Riffusion website, which allows you to type in queries and hear the resulting sonograms.
Enlarge / A screenshot of the Riffusion website, which allows you to type in queries and hear the resulting sonograms.

It can also merge styles. For example, types in “smooth tropical dance jazz” brings elements of different genres for a novel result, encouraging experiments by mixing styles.

Of course, Riffusion isn’t the first AI-powered music generator. Earlier this year, Harmonai released Dance Diffusion, an AI-powered generative music model. OpenAI’s Jukebox, announced in 2020, also generates new music using a neural network. And websites like Soundraw create music non-stop on the fly.

Compared to the more streamlined AI music efforts, Riffusion feels more like the hobby project it is. The music it generates ranges from interesting to incomprehensible, but it remains a remarkable application of latent diffusion technology that manipulates audio in a visual space.

The Riffusion model checkpoint and code are available on GitHub.

Source

Leave a Reply

Your email address will not be published.

Related Articles

Back to top button