The instrumental background music in the video below is AI-generated using Lyria 3 by Google DeepMind. Lyria 3 allows anyone to generate AI music from text and image prompts. The music demos in this article take it further by adding another input prompt modality, your voice. Let’s proceed to generate your first music with Lyria 3 via the Gemini API.
Requirements and Environment Setup
To quickly generate music with Lyria 3, you can try it in its playground. The Gemini API also provides some code examples to run to generate your music. Our generated music demos require the following tech stacks.
- Vision Agents: To build and orchestrate the agentic system. It provides a network transport for low-latency conversational audio and allows the Gemini Live and Lyria 3 APIs to be used as Python plugins. Log in or register for a Stream dashboard account and use your API key and secret to run the sample codes in this article.
- The
lyria-3-clip-previewModel: For generating short clips (mp3), loops, and previews at 30 seconds. - The
lyria-3-pro-previewModel: For generating full-length songs (mp3) with verses, choruses, and bridges. - Lyria 3 Python plugin for Vision Agents.
- Gemini Live API: To access a Gemini audio generation model for realtime agentic voice output. Go to Google AI Studio and generate a new API key.
- Twilio: To provide a telephony service and allow the voice agent to be called with an Android or iOS device’s native phone call UI. You should purchase a Twilio phone number to run and test this demos.
- NGROK: To convert the URL of your localhost to a public URL for inbound and outbound phone calling with a Twilio phone number.
Music Generation With Gemini and Lyria 3
Generating AI music with Lyria 3 can be done using two models, depending on your use case. You can create a short-form music of 30 seconds or a full-length one of about 3 - 4 minutes. The models are available in the Gemini API. They all support multimodal prompts, including text, images, and voice. Once you send a prompt, the Gemini API sends the request to the Lyria 3 model for music generation.
Gemini API: How To Generate Music From Text Prompts
As shown in the diagram of the previous section, you can use three different modalities (text, image, voice) to generate AI music with Lyria 3 via the Gemini API. Depending on what you want to use the generated music for, it can be a short clip or a full-length audio.
How To Generate a Short-Form Music Clip
A 30-second (short) music clip/loop generation is great for using as a soundtrack and background music for videos to get people’s attention. Generating one via the Gemini API requires installing the Google GenAI SDK by running this command.
uv add google-genai
If you already have the Google GenAI SDK installed in your uv project, ensure you upgrade it.
uv add -U google-genai
We can now generate a 30-second cheerful acoustic folk song with guitar and harmonica with this Python script.
1234567891011121314151617181920212223242526272829"""Generate a short (30-second) music clip with Lyria 3 Clip. Docs: https://ai.google.dev/gemini-api/docs/music-generation """ import os from dotenv import load_dotenv from google import genai load_dotenv() client = genai.Client(api_key=os.getenv("GEMINI_API_KEY")) response = client.models.generate_content( model="lyria-3-clip-preview", contents=( "Create a 30-second cheerful acoustic folk song with " "guitar and harmonica." ), ) for part in response.parts: if part.text is not None: print(part.text) elif part.inline_data is not None: with open("clip.mp3", "wb") as f: f.write(part.inline_data.data) print("Audio saved to clip.mp3")
In this sample code, once the music is generated, we save it to clip.mp3. To run this basic example, you should set your Google API key in the project’s .env.
GEMINI_API_KEY=...
When you run the script above, you should see the generated music similar to this.
How To Generate a Long-Form Music Clip
Lyria 3 Pro understands musical structures, such as verses, choruses, and bridges.
You can use it to produce songs that last a couple of minutes. The model also allows you to modify the duration via a prompt. For example, create a 2-minute song or by using
timestamps to define the music structure.
12345678910111213141516171819202122232425262728293031323334import os from dotenv import load_dotenv from google import genai load_dotenv() client = genai.Client(api_key=os.getenv("GEMINI_API_KEY")) response = client.models.generate_content( model="lyria-3-pro-preview", contents=( "An epic cinematic orchestral piece about a journey home. " "Starts with a solo piano intro, builds through sweeping " "strings, and climaxes with a massive wall of sound." ), ) lyrics: list[str] = [] audio_data: bytes | None = None for part in response.parts: if part.text is not None: lyrics.append(part.text) elif part.inline_data is not None: audio_data = part.inline_data.data if lyrics: print("Lyrics / structure:\n" + "\n".join(lyrics)) if audio_data: with open("full_length_song.mp3", "wb") as f: f.write(audio_data) print("Audio saved to full_length_song.mp3")
Without specifying a custom duration, running this sample code will generate a music with a duration of about 3 minutes.
Gemini API: How To Generate Music From Input Images
Using Lyria 3 Pro, you can generate short and full-length music from a reference image. Once you supply an input image, the model uses its vision capabilities to analyze the scene and mood of the image and create a song out of it. To test this example, create a new Python file music_from_image.py and fill it with this code. In the same directory where your music_from_image.py is located, add an image of your choice, for example, birds_landscape.jpeg.
1234567891011121314151617181920212223242526272829303132333435363738394041import os from dotenv import load_dotenv from google import genai from PIL import Image load_dotenv() client = genai.Client(api_key=os.getenv("GEMINI_API_KEY")) IMAGE_PATH = "birds_landscape.jpeg" image = Image.open(IMAGE_PATH) response = client.models.generate_content( model="lyria-3-pro-preview", contents=[ ( "An atmospheric ambient track inspired by the birds and landscape mood and " "colors in this image." ), image, ], ) lyrics: list[str] = [] audio_data: bytes | None = None for part in response.parts: if part.text is not None: lyrics.append(part.text) elif part.inline_data is not None: audio_data = part.inline_data.data if lyrics: print("Lyrics / structure:\n" + "\n".join(lyrics)) if audio_data: with open("music_from_image.mp3", "wb") as f: f.write(audio_data) print("Audio saved to music_from_image.mp3")
To generate the music with this Python script, you should specify a path to the reference image and open it. You should also specify where you want to save the output song.
123IMAGE_PATH = "birds_landscape.jpeg" image = Image.open(IMAGE_PATH)
The example here uses this prompt.
1An atmospheric ambient track inspired by the birds and landscape mood, and colors in this image
Gemini Live API and Vision Agents: Generate Music Via a Video Call
In the previous sections, we generated AI music with Lyria 3 via the Gemini API. This section will show you how to generate music with your voice by interacting with an agent during a video call in Vision Agents. Vision Agents helps developers to bring voice, video, and vision to AI applications. You can, for example, use it to build a group video-calling and live-meeting service with AI integration for meeting notes, summaries, and more.
Start With a New Python Project
Begin with a uv-based Python project, install Vision Agents, Gemini, and Lyria 3 plugins, and configure your environment variables.
12345678910111213# New project with uv uv init # Install Vision Agents and the Gemini plugin uv add vision-agents uv add "vision-agents[getstream, gemini]" # In your .env STREAM_API_KEY=... STREAM_API_KEY=... EXAMPLE_BASE_URL=https://demo.visionagents.ai GOOGLE_API_KEY=...
Running uv add "vision-agents[getstream, gemini]" will install getstream as the default audio and video transport, along with the gemini plugin for Vision Agents. What is missing is the Lyria 3 model. The gemini (Gemini Live API) plugin endpoint differs from that of Lyria 3. This is because the Gemini Live (audio generation) models are supported in Vision Agents by default, but the Lyria 3 models are not.
To bring the Lyria 3 plugin to your Python project, clone its repo to the root of the project and run uv sync to install all the Lyria dependencies. Instead of this approach, you can also add the Lyria 3 music generation models to Vision Agents by creating Lyria 3 as a custom plugin.
Let’s now create a Python script in Vision Agents for the music generation AI.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566""" Environment variables needed: - GOOGLE_API_KEY (for Gemini Realtime + Lyria RealTime) - STREAM_API_KEY and STREAM_API_SECRET (for Stream Video) """ import asyncio from vision_agents.core import Agent, AgentLauncher, Runner, User from vision_agents.plugins import gemini, getstream from vision_agents.plugins.lyria import MusicProcessor async def create_agent(**kwargs) -> Agent: processor = MusicProcessor( initial_prompt="Ambient chill music", bpm=90, density=0.5, brightness=0.5, duration_seconds=30, ) llm = gemini.Realtime(fps=3) @llm.register_function( description="Generate a 30-second music track based on the user's description. " "Use descriptive terms like genre, instruments, mood, and tempo. " "Returns immediately while music generates in the background." ) async def generate_music(prompt: str) -> str: await processor.generate_music_async(prompt=prompt) return f"Music generation started for: {prompt}. It will be saved to the generated_music folder when complete (~30 seconds)." @llm.register_function( description="Change the music style/genre for future generations." ) async def change_music_style(prompt: str) -> str: await processor.update_prompt(prompt) return f"Music style changed to: {prompt}" @llm.register_function( description="Adjust the tempo (BPM) of the music. Range: 40-180." ) async def set_tempo(bpm: int) -> str: await processor.set_config(bpm=bpm) return f"Tempo set to {bpm} BPM" return Agent( edge=getstream.Edge(), agent_user=User(name="Music Generator", id="lyria-music-agent"), instructions=( "You are a music-generating AI assistant powered by Google's Lyria 3. " "When users describe the kind of music they want, use the generate_music " "function to create a 30-second instrumental track. You can also adjust " "the tempo and style. Keep your responses friendly and musical. " "Describe what you're generating before starting." ), llm=llm, processors=[processor], ) async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None: call = await agent.create_call(call_type, call_id) async with agent.join(call): await agent.finish() Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
The Lyria 3 music generator agent, powered by Vision Agents, creates 30-second instrumental music tracks from voice prompts during video calls. This example uses Gemini Live for native speech-to-speech. When you run the above sample code, you should be able to have live video conversations with the AI agent to generate any music you could imagine.
Gemini Live API and Vision Agents: Generate Music Via a Phone Call
With the simple Python script in the previous section, we were able to create music in Vision Agents via a video call using Lyria 3 and the Gemini Live API. Is that also possible with actual phone calls? Let’s find out in this section.
Similar to how we generated music with Lyria 3 in Vision Agents in the previous section, we should use the same setup here, with additional installation and configuration.
12345678910111213141516171819# New project with uv uv init # Install Vision Agents and the Gemini plugin uv add vision-agents uv add "vision-agents[getstream, gemini]" # In your .env STREAM_API_KEY=... STREAM_API_KEY=... NGROK_URL=https://90c0-176-72-38-94.ngrok-free.app GOOGLE_API_KEY=... TWILIO_ACCOUNT_SID=... TWILIO_AUTH_TOKEN=... # Run NGROK_URL=https://90c0-176-72-38-94.ngrok-free.app uv run music_gen_via_phone_call.py --from +19810902211 --to +328458519934
The following are the additional requirements to the previous section.
Set NGROK URL
We used EXAMPLE_BASE_URL=https://demo.visionagents.ai to launch Stream Video to interact with the voice agent to generate AI music after launching the Python script. This is not needed for phone calling. We should rather create an agent, run it on localhost, and use NGROK to expose it to the public. Install NGROK with this command.
brew install ngrok
After installing NGROK with Brew, launch the Terminal and run this.
ngrok http 8000
That will generate a public NGROK URL similar to the one highlighted in this image.
Obtain a Twilio Phone Number
The public NGROK URL you just generated is required to run the agent via phone, using an active Twilio number. If you do not have one yet, buy a new Twilio phone number, go to your dashboard, and set the Call comes in webhook URL to the generated NGROK URL. For our demo, it's https://90c0-176-72-38-94.ngrok-free.app but yours may be different.
Configure the Phone Call Agent
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187import asyncio import logging import os import uuid import click import uvicorn from dotenv import load_dotenv from fastapi import FastAPI, WebSocket from twilio.rest import Client from vision_agents.core import Agent, User from vision_agents.plugins import gemini, getstream, twilio from plugins.lyria.vision_agents.plugins.lyria import MusicProcessor load_dotenv() logger = logging.getLogger(__name__) logging.basicConfig(level=logging.INFO) NGROK_URL = os.environ["NGROK_URL"].replace("https://", "").replace("http://", "").rstrip("/") app = FastAPI() call_registry = twilio.TwilioCallRegistry() async def create_agent() -> Agent: processor = MusicProcessor( initial_prompt="Ambient chill music", bpm=90, density=0.5, brightness=0.5, duration_seconds=30, ) llm = gemini.Realtime() @llm.register_function( description="Generate a 30-second instrumental music track. " "Accepts a voice prompt describing the desired genre, instruments, mood, or style. " "Returns immediately while music generates in the background." ) async def generate_music(prompt: str) -> str: await processor.generate_music_async(prompt=prompt) return ( f"Music generation started for: {prompt}. " "The track will be ready in about 30 seconds and will play automatically." ) @llm.register_function( description="Change the music style for the next generation." ) async def change_music_style(prompt: str) -> str: await processor.update_prompt(prompt) return f"Music style changed to: {prompt}" @llm.register_function( description="Set the tempo (beats per minute) for music generation. Range: 40-180." ) async def set_tempo(bpm: int) -> str: await processor.set_config(bpm=bpm) return f"Tempo set to {bpm} BPM" @llm.register_function( description="Blend two music styles with weights (0.0-1.0). " "Example: style1='Jazz', weight1=0.7, style2='Electronic', weight2=0.3" ) async def blend_styles( style1: str, weight1: float, style2: str, weight2: float ) -> str: prompts = [ {"text": style1, "weight": weight1}, {"text": style2, "weight": weight2}, ] await processor.set_weighted_prompts(prompts) return f"Blending styles: {style1}:{weight1}, {style2}:{weight2}" return Agent( edge=getstream.Edge(), agent_user=User(id="ai-agent", name="Lyria Music Agent"), instructions=( "You are a music-generating AI assistant on a phone call, powered by " "Google's Lyria 3. When the user describes the kind of music they want, " "use the generate_music function to create a 30-second instrumental track. " "You can also adjust the tempo with set_tempo, change the style with " "change_music_style, or blend two styles with blend_styles. " "Start by greeting the caller and asking what kind of music they'd like. " "Keep your responses concise and friendly — this is a phone call." ), llm=llm, processors=[processor], ) async def initiate_outbound_call(from_number: str, to_number: str) -> str: """Initiate an outbound call via Twilio. Returns the call_id.""" twilio_client = Client( os.environ["TWILIO_ACCOUNT_SID"], os.environ["TWILIO_AUTH_TOKEN"] ) call_id = str(uuid.uuid4()) async def prepare_call(): agent = await create_agent() phone_user = User(name=f"Outbound call {call_id[:8]}", id=f"phone-{call_id}") await agent.edge.create_users([agent.agent_user, phone_user]) agent.edge.agent_user_id = agent.agent_user.id stream_call = await agent.create_call("default", call_id) logger.info("prepared the call, ready to start") return agent, phone_user, stream_call twilio_call = call_registry.create(call_id, prepare=prepare_call) url = f"wss://{NGROK_URL}/twilio/media/{call_id}/{twilio_call.token}" logger.info( f"Forwarding to media url: {url} \n %s", twilio.create_media_stream_twiml(url) ) twilio_client.calls.create( twiml=twilio.create_media_stream_twiml(url), to=to_number, from_=from_number, ) logger.info(f"📞 Initiated call {call_id} from {from_number} to {to_number}") return call_id @app.websocket("/twilio/media/{call_sid}/{token}") async def media_stream(websocket: WebSocket, call_sid: str, token: str): twilio_call = call_registry.validate(call_sid, token) logger.info(f"🔗 Media stream connected for call {call_sid}") twilio_stream = twilio.TwilioMediaStream(websocket) await twilio_stream.accept() twilio_call.twilio_stream = twilio_stream try: ( agent, phone_user, stream_call, ) = await twilio_call.await_prepare() twilio_call.stream_call = stream_call await twilio.attach_phone_to_call(stream_call, twilio_stream, phone_user.id) async with agent.join(stream_call, participant_wait_timeout=0): await agent.llm.simple_response( text="Greet the caller and ask what kind of music they'd like you to generate." ) await twilio_stream.run() finally: call_registry.remove(call_sid) async def run_with_server(from_number: str, to_number: str): """Start the server and initiate the outbound call once ready.""" config = uvicorn.Config(app, host="localhost", port=8000, log_level="info") server = uvicorn.Server(config) server_task = asyncio.create_task(server.serve()) while not server.started: await asyncio.sleep(0.1) logger.info("🚀 Server ready, initiating outbound call...") await initiate_outbound_call(from_number, to_number) await server_task @click.command() @click.option( "--from", "from_number", required=True, help="The Twilio phone number to call from (must be active in your Twilio account)", ) @click.option("--to", "to_number", required=True, help="The phone number to call") def main(from_number: str, to_number: str): logger.info( "Starting Lyria music generator outbound call. " "Note: latency is higher in dev. Deploy to US east for low latency." ) asyncio.run(run_with_server(from_number, to_number)) if __name__ == "__main__": main()
Here is how it works. Using the above sample code, we create a Twilio-powered outbound phone-call agent that initializes a real Public Switched Telephony Network (PSTN) call, bridges the audio into a Stream call, and lets you generate Lyria 3-driven music by talking to a Gemini Live audio model in realtime.
To run this example in your Terminal, you should first execute the command ngrok http 8000 and keep the NGROK server running. Copy the app URL and run the following in the Terminal.
1NGROK_URL=https://90c0-176-72-38-94.ngrok-free.app uv run music_gen_via_phone_call.py --from +19810902211 --to +328458519934
From the above:
- https://90c0-176-72-38-94.ngrok-free.app is your currently running NGROK app’s URL.
--from +19810902211is your purchased Twilio phone number.--to +328458519934is the phone number you want to call.
Note: For an inbound phone call (call initiation from your phone to a Twilio number), the positions of the two phone numbers above must be switched.
If everything goes smoothly after running NGROK_URL=https://90c0-176-72-38-94.ngrok-free.app uv run music_gen_via_phone_call.py --from +19810902211 --to +328458519934, the familiar phone calling UI will be launched on your mobile device. On iPhone and iPad, it will open the built-in Phone app. The same is true for Android devices.
Troubleshooting Guide
As you just noticed, configuring and generating music via phone calls with Lyria 3 in Vision Agents involves several steps. This can result in an unsuccessful output. Here are some known issues and how to fix them when you get errors in the Terminal.
- No Phone Call Interface: This may occur when there are spaces in the phone numbers (inbound or outbound). Both the specified numbers for inbound and outbound calls must not contain spaces.
- Twilio Phone Number Not Active: To call the agent from your phone, the Twilio number must be activated and registered.
- NGROK URL Not Specified: Ensure the NGROK server URL you specify on the Twilio dashboard matches the one currently in use. Running
ngrok http 8000will generate a different URL each time.
Where To Go From Here
Congratulations 🎉 👏. You have now discovered a fun and different ways to make short and full-length AI music with Lyria 3. We began with AI music generation with the Lyria 3 models via the Gemini API. Not only that, you created AI sounds in Vision Agents during video and phone calls.
Music generation with Lyria goes beyond what we covered in this article. The models allow you to steer their output in so many ways. For example, starting with guitar and mixing in piano somewhere in the middle. Check out the Lyria 3 docs to learn more about customizing your AI music generation output. Additionally, creating effective music with Lyria 3 involves mastering your text/voice prompts. Refer to the Lyria 3 prompt guide for details.
We focused on one of the many ways to use speech AI for fun: generating AI music with nothing but your voice over the phone. The Vision Agents repo has several other examples to explore, build with the open-source framework, and deploy anywhere you want.

