Lyria 3 - Google’s latest AI Music Generator

AIMusic

Feb 19

Google just released Lyria 3, its latest music generation model, and it’s impressive. The model generates 30-second clips of original music inspired by different styles, providing both the audio MP3 file and an AI-generated artwork video in MP4 format.

Regarding the prompts: you can provide a genre, tempo, lyrics, instruments, vocal style, or even a photo, and it will generate your clip in no time. In some cases, a prompt can be as simple as selecting a predefined style and adding a short description like “adventure game 8-bit style with explosions and fun.” Other prompts can be more complex; you can specify lyrics, toggle vocals on or off, or choose between male and female voices.

It isn’t perfect yet—sometimes it goes off-rails and ignores instructions—and the length limit is currently a dealbreaker for releasing full songs. However, I’m sure these features will improve in future versions. You can generate samples across many genres, including those they recommend. I tried rock, rap, reggae, “cinematic”, “emo”, and “8-bit”.

Below are a few samples I generated.

The Technical Architecture of Lyria 3

Lyria 3 represents a leap in generative audio by moving from simple loop-based generation to a cross-modal latent diffusion architecture. Unlike earlier models that required manual lyric inputs, Lyria 3 utilizes a joint embedding space where text, visual (image/video), and audio data are mapped to shared representations. This allows the model to "understand" the emotional context of a sunset video or a text prompt and translate it into coherent musical structures (intro, verse, chorus).

The model operates at a high fidelity of 48kHz (24-bit PCM) and employs chunk-based autoregression for real-time control, meaning it can generate audio in 2-second segments while maintaining "long-range coherence"—ensuring the rhythm and melody remain consistent across the 30-second duration. For safety, Google integrates SynthID directly into the audio waveform at the point of generation, creating a robust, machine-readable signal that survives compression and editing without affecting the listener's experience. Additionally, it works in tandem with the Nano Banana image model to simultaneously generate matching cover art for every track.