Google researchers created an AI that can generate minutes-long musical pieces from text prompts, similar to how systems like DALL-E generate visuals from written prompts. The company has released a number of samples it created using the model, which is dubbed MusicLM, even if you can’t experiment with it for yourself. By modeling the conditional music-generating process as a hierarchical sequence-to-sequence modeling issue, MusicLM generates music at a constant 24 kHz over the course of many minutes.
In order to create songs that make sense for complex descriptions, MusicLM was trained on a dataset of 280,000 hours of music, according to the research paper. MusicLM samples range from 30-second clips that sound like whole songs and are created from paragraph-long descriptions that specify a genre, atmosphere, and even specific instruments, to five-minute pieces made from just one or two words, like melodic techno. A collection of consecutively written descriptions can also be transformed into a musical story or narrative by using pre-existing melodies, whether they are whistled, hummed, sung, or played on an instrument.
The AI can even simulate human vocals, but the quality seems a bit grainy. Also, the lyrics don’t make any sense – they sound as if they are sung in a non-existent language. Maybe it’s just Simlish – the fictional language used in The Sims, and some other games by Electronic Arts.
Unfortunately, Google is unlikely to release MusicLM to the public. The most obvious justification is that it violates copyrighted music. When Google tested it, they discovered that at least 1% of every song the system generated matched the music the software was trained on. This may not seem like a lot but the company decided to be more cautious with this project than some of its competitors that have developed similar tools.