Gemini Live API & Lyria 3: Generate Music From Text, Phone & Video Calls

The instrumental background music in the video below is AI-generated using Lyria 3 by Google DeepMind. Lyria 3 allows anyone to generate AI music from text and image prompts. The music demos in this article take it further by adding another input prompt modality, your voice. Let’s proceed to generate your first music with Lyria 3 via the Gemini API.

Requirements and Environment Setup

To quickly generate music with Lyria 3, you can try it in its playground. The Gemini API also provides some code examples to run to generate your music. Our generated music demos require the following tech stacks.

Vision Agents: To build and orchestrate the agentic system. It provides a network transport for low-latency conversational audio and allows the Gemini Live and Lyria 3 APIs to be used as Python plugins. Log in or register for a Stream dashboard account and use your API key and secret to run the sample codes in this article.
The lyria-3-clip-preview Model: For generating short clips (mp3), loops, and previews at 30 seconds.
Thelyria-3-pro-preview Model: For generating full-length songs (mp3) with verses, choruses, and bridges.
Lyria 3 Python plugin for Vision Agents.
Gemini Live API: To access a Gemini audio generation model for realtime agentic voice output. Go to Google AI Studio and generate a new API key.
Twilio: To provide a telephony service and allow the voice agent to be called with an Android or iOS device’s native phone call UI. You should purchase a Twilio phone number to run and test this demos.
NGROK: To convert the URL of your localhost to a public URL for inbound and outbound phone calling with a Twilio phone number.

Music Generation With Gemini and Lyria 3

Generating AI music with Lyria 3 can be done using two models, depending on your use case. You can create a short-form music of 30 seconds or a full-length one of about 3 - 4 minutes. The models are available in the Gemini API. They all support multimodal prompts, including text, images, and voice. Once you send a prompt, the Gemini API sends the request to the Lyria 3 model for music generation.

Gemini API: How To Generate Music From Text Prompts

As shown in the diagram of the previous section, you can use three different modalities (text, image, voice) to generate AI music with Lyria 3 via the Gemini API. Depending on what you want to use the generated music for, it can be a short clip or a full-length audio.

How To Generate a Short-Form Music Clip

A 30-second (short) music clip/loop generation is great for using as a soundtrack and background music for videos to get people’s attention. Generating one via the Gemini API requires installing the Google GenAI SDK by running this command.

uv add google-genai

If you already have the Google GenAI SDK installed in your uv project, ensure you upgrade it.

uv add -U google-genai

We can now generate a 30-second cheerful acoustic folk song with guitar and harmonica with this Python script.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
"""Generate a short (30-second) music clip with Lyria 3 Clip.

Docs: https://ai.google.dev/gemini-api/docs/music-generation
"""

import os

from dotenv import load_dotenv
from google import genai

load_dotenv()

client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))

response = client.models.generate_content(
    model="lyria-3-clip-preview",
    contents=(
        "Create a 30-second cheerful acoustic folk song with "
        "guitar and harmonica."
    ),
)

for part in response.parts:
    if part.text is not None:
        print(part.text)
    elif part.inline_data is not None:
        with open("clip.mp3", "wb") as f:
            f.write(part.inline_data.data)
        print("Audio saved to clip.mp3")

In this sample code, once the music is generated, we save it to clip.mp3. To run this basic example, you should set your Google API key in the project’s .env.

GEMINI_API_KEY=...

When you run the script above, you should see the generated music similar to this.

How To Generate a Long-Form Music Clip

Lyria 3 Pro understands musical structures, such as verses, choruses, and bridges.
You can use it to produce songs that last a couple of minutes. The model also allows you to modify the duration via a prompt. For example, create a 2-minute song or by using
timestamps to define the music structure.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import os

from dotenv import load_dotenv
from google import genai

load_dotenv()

client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))

response = client.models.generate_content(
    model="lyria-3-pro-preview",
    contents=(
        "An epic cinematic orchestral piece about a journey home. "
        "Starts with a solo piano intro, builds through sweeping "
        "strings, and climaxes with a massive wall of sound."
    ),
)

lyrics: list[str] = []
audio_data: bytes | None = None

for part in response.parts:
    if part.text is not None:
        lyrics.append(part.text)
    elif part.inline_data is not None:
        audio_data = part.inline_data.data

if lyrics:
    print("Lyrics / structure:\n" + "\n".join(lyrics))

if audio_data:
    with open("full_length_song.mp3", "wb") as f:
        f.write(audio_data)
    print("Audio saved to full_length_song.mp3")

Without specifying a custom duration, running this sample code will generate a music with a duration of about 3 minutes.

Gemini API: How To Generate Music From Input Images

Using Lyria 3 Pro, you can generate short and full-length music from a reference image. Once you supply an input image, the model uses its vision capabilities to analyze the scene and mood of the image and create a song out of it. To test this example, create a new Python file music_from_image.py and fill it with this code. In the same directory where your music_from_image.py is located, add an image of your choice, for example, birds_landscape.jpeg.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import os

from dotenv import load_dotenv
from google import genai
from PIL import Image

load_dotenv()

client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))

IMAGE_PATH = "birds_landscape.jpeg"

image = Image.open(IMAGE_PATH)

response = client.models.generate_content(
    model="lyria-3-pro-preview",
    contents=[
        (
            "An atmospheric ambient track inspired by the birds and landscape mood and "
            "colors in this image."
        ),
        image,
    ],
)

lyrics: list[str] = []
audio_data: bytes | None = None

for part in response.parts:
    if part.text is not None:
        lyrics.append(part.text)
    elif part.inline_data is not None:
        audio_data = part.inline_data.data

if lyrics:
    print("Lyrics / structure:\n" + "\n".join(lyrics))

if audio_data:
    with open("music_from_image.mp3", "wb") as f:
        f.write(audio_data)
    print("Audio saved to music_from_image.mp3")

To generate the music with this Python script, you should specify a path to the reference image and open it. You should also specify where you want to save the output song.

python

1
2
3
IMAGE_PATH = "birds_landscape.jpeg"

image = Image.open(IMAGE_PATH)

The example here uses this prompt.

markdown

1
An atmospheric ambient track inspired by the birds and landscape mood, and colors in this image

Gemini Live API and Vision Agents: Generate Music Via a Video Call

In the previous sections, we generated AI music with Lyria 3 via the Gemini API. This section will show you how to generate music with your voice by interacting with an agent during a video call in Vision Agents. Vision Agents helps developers to bring voice, video, and vision to AI applications. You can, for example, use it to build a group video-calling and live-meeting service with AI integration for meeting notes, summaries, and more.

Start With a New Python Project

Building your own app? Get early access to our Livestream or Video Calling API and launch in days!

Begin with a uv-based Python project, install Vision Agents, Gemini, and Lyria 3 plugins, and configure your environment variables.

bash

1
2
3
4
5
6
7
8
9
10
11
12
13
# New project with uv
uv init

# Install Vision Agents and the Gemini plugin
uv add vision-agents
uv add "vision-agents[getstream, gemini]"

# In your .env
STREAM_API_KEY=...
STREAM_API_KEY=...
EXAMPLE_BASE_URL=https://demo.visionagents.ai

GOOGLE_API_KEY=...

Running uv add "vision-agents[getstream, gemini]" will install getstream as the default audio and video transport, along with the gemini plugin for Vision Agents. What is missing is the Lyria 3 model. The gemini (Gemini Live API) plugin endpoint differs from that of Lyria 3. This is because the Gemini Live (audio generation) models are supported in Vision Agents by default, but the Lyria 3 models are not.

To bring the Lyria 3 plugin to your Python project, clone its repo to the root of the project and run uv sync to install all the Lyria dependencies. Instead of this approach, you can also add the Lyria 3 music generation models to Vision Agents by creating Lyria 3 as a custom plugin.

Let’s now create a Python script in Vision Agents for the music generation AI.

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
"""
Environment variables needed:
    - GOOGLE_API_KEY (for Gemini Realtime + Lyria RealTime)
    - STREAM_API_KEY and STREAM_API_SECRET (for Stream Video)
"""

import asyncio

from vision_agents.core import Agent, AgentLauncher, Runner, User
from vision_agents.plugins import gemini, getstream
from vision_agents.plugins.lyria import MusicProcessor

async def create_agent(**kwargs) -> Agent:
    processor = MusicProcessor(
        initial_prompt="Ambient chill music",
        bpm=90,
        density=0.5,
        brightness=0.5,
        duration_seconds=30,
    )

    llm = gemini.Realtime(fps=3)

    @llm.register_function(
        description="Generate a 30-second music track based on the user's description. "
        "Use descriptive terms like genre, instruments, mood, and tempo. "
        "Returns immediately while music generates in the background."
    )
    async def generate_music(prompt: str) -> str:
        await processor.generate_music_async(prompt=prompt)
        return f"Music generation started for: {prompt}. It will be saved to the generated_music folder when complete (~30 seconds)."

    @llm.register_function(
        description="Change the music style/genre for future generations."
    )
    async def change_music_style(prompt: str) -> str:
        await processor.update_prompt(prompt)
        return f"Music style changed to: {prompt}"

    @llm.register_function(
        description="Adjust the tempo (BPM) of the music. Range: 40-180."
    )
    async def set_tempo(bpm: int) -> str:
        await processor.set_config(bpm=bpm)
        return f"Tempo set to {bpm} BPM"

    return Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Music Generator", id="lyria-music-agent"),
        instructions=(
            "You are a music-generating AI assistant powered by Google's Lyria 3. "
            "When users describe the kind of music they want, use the generate_music "
            "function to create a 30-second instrumental track. You can also adjust "
            "the tempo and style. Keep your responses friendly and musical. "
            "Describe what you're generating before starting."
        ),
        llm=llm,
        processors=[processor],
    )

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)
    async with agent.join(call):
        await agent.finish()

Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

The Lyria 3 music generator agent, powered by Vision Agents, creates 30-second instrumental music tracks from voice prompts during video calls. This example uses Gemini Live for native speech-to-speech. When you run the above sample code, you should be able to have live video conversations with the AI agent to generate any music you could imagine.

Gemini Live API and Vision Agents: Generate Music Via a Phone Call

With the simple Python script in the previous section, we were able to create music in Vision Agents via a video call using Lyria 3 and the Gemini Live API. Is that also possible with actual phone calls? Let’s find out in this section.

Similar to how we generated music with Lyria 3 in Vision Agents in the previous section, we should use the same setup here, with additional installation and configuration.

bash

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# New project with uv
uv init

# Install Vision Agents and the Gemini plugin
uv add vision-agents
uv add "vision-agents[getstream, gemini]"

# In your .env
STREAM_API_KEY=...
STREAM_API_KEY=...
NGROK_URL=https://90c0-176-72-38-94.ngrok-free.app

GOOGLE_API_KEY=...

TWILIO_ACCOUNT_SID=...
TWILIO_AUTH_TOKEN=...

# Run
NGROK_URL=https://90c0-176-72-38-94.ngrok-free.app uv run music_gen_via_phone_call.py --from +19810902211 --to +328458519934

The following are the additional requirements to the previous section.

Set NGROK URL

We used EXAMPLE_BASE_URL=https://demo.visionagents.ai to launch Stream Video to interact with the voice agent to generate AI music after launching the Python script. This is not needed for phone calling. We should rather create an agent, run it on localhost, and use NGROK to expose it to the public. Install NGROK with this command.

brew install ngrok

After installing NGROK with Brew, launch the Terminal and run this.

ngrok http 8000

That will generate a public NGROK URL similar to the one highlighted in this image.

Obtain a Twilio Phone Number

The public NGROK URL you just generated is required to run the agent via phone, using an active Twilio number. If you do not have one yet, buy a new Twilio phone number, go to your dashboard, and set the Call comes in webhook URL to the generated NGROK URL. For our demo, it's https://90c0-176-72-38-94.ngrok-free.app but yours may be different.

Configure the Phone Call Agent

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
import asyncio
import logging
import os
import uuid

import click
import uvicorn
from dotenv import load_dotenv
from fastapi import FastAPI, WebSocket
from twilio.rest import Client

from vision_agents.core import Agent, User
from vision_agents.plugins import gemini, getstream, twilio

from plugins.lyria.vision_agents.plugins.lyria import MusicProcessor

load_dotenv()

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

NGROK_URL = os.environ["NGROK_URL"].replace("https://", "").replace("http://", "").rstrip("/")

app = FastAPI()
call_registry = twilio.TwilioCallRegistry()

async def create_agent() -> Agent:
    processor = MusicProcessor(
        initial_prompt="Ambient chill music",
        bpm=90,
        density=0.5,
        brightness=0.5,
        duration_seconds=30,
    )

    llm = gemini.Realtime()

    @llm.register_function(
        description="Generate a 30-second instrumental music track. "
        "Accepts a voice prompt describing the desired genre, instruments, mood, or style. "
        "Returns immediately while music generates in the background."
    )
    async def generate_music(prompt: str) -> str:
        await processor.generate_music_async(prompt=prompt)
        return (
            f"Music generation started for: {prompt}. "
            "The track will be ready in about 30 seconds and will play automatically."
        )

    @llm.register_function(
        description="Change the music style for the next generation."
    )
    async def change_music_style(prompt: str) -> str:
        await processor.update_prompt(prompt)
        return f"Music style changed to: {prompt}"

    @llm.register_function(
        description="Set the tempo (beats per minute) for music generation. Range: 40-180."
    )
    async def set_tempo(bpm: int) -> str:
        await processor.set_config(bpm=bpm)
        return f"Tempo set to {bpm} BPM"

    @llm.register_function(
        description="Blend two music styles with weights (0.0-1.0). "
        "Example: style1='Jazz', weight1=0.7, style2='Electronic', weight2=0.3"
    )
    async def blend_styles(
        style1: str, weight1: float, style2: str, weight2: float
    ) -> str:
        prompts = [
            {"text": style1, "weight": weight1},
            {"text": style2, "weight": weight2},
        ]
        await processor.set_weighted_prompts(prompts)
        return f"Blending styles: {style1}:{weight1}, {style2}:{weight2}"

    return Agent(
        edge=getstream.Edge(),
        agent_user=User(id="ai-agent", name="Lyria Music Agent"),
        instructions=(
            "You are a music-generating AI assistant on a phone call, powered by "
            "Google's Lyria 3. When the user describes the kind of music they want, "
            "use the generate_music function to create a 30-second instrumental track. "
            "You can also adjust the tempo with set_tempo, change the style with "
            "change_music_style, or blend two styles with blend_styles. "
            "Start by greeting the caller and asking what kind of music they'd like. "
            "Keep your responses concise and friendly — this is a phone call."
        ),
        llm=llm,
        processors=[processor],
    )

async def initiate_outbound_call(from_number: str, to_number: str) -> str:
    """Initiate an outbound call via Twilio. Returns the call_id."""
    twilio_client = Client(
        os.environ["TWILIO_ACCOUNT_SID"], os.environ["TWILIO_AUTH_TOKEN"]
    )

    call_id = str(uuid.uuid4())

    async def prepare_call():
        agent = await create_agent()
        phone_user = User(name=f"Outbound call {call_id[:8]}", id=f"phone-{call_id}")

        await agent.edge.create_users([agent.agent_user, phone_user])
        agent.edge.agent_user_id = agent.agent_user.id

        stream_call = await agent.create_call("default", call_id)
        logger.info("prepared the call, ready to start")
        return agent, phone_user, stream_call

    twilio_call = call_registry.create(call_id, prepare=prepare_call)
    url = f"wss://{NGROK_URL}/twilio/media/{call_id}/{twilio_call.token}"
    logger.info(
        f"Forwarding to media url: {url} \n %s", twilio.create_media_stream_twiml(url)
    )

    twilio_client.calls.create(
        twiml=twilio.create_media_stream_twiml(url),
        to=to_number,
        from_=from_number,
    )
    logger.info(f"📞 Initiated call {call_id} from {from_number} to {to_number}")
    return call_id

@app.websocket("/twilio/media/{call_sid}/{token}")
async def media_stream(websocket: WebSocket, call_sid: str, token: str):
    twilio_call = call_registry.validate(call_sid, token)

    logger.info(f"🔗 Media stream connected for call {call_sid}")

    twilio_stream = twilio.TwilioMediaStream(websocket)
    await twilio_stream.accept()
    twilio_call.twilio_stream = twilio_stream

    try:
        (
            agent,
            phone_user,
            stream_call,
        ) = await twilio_call.await_prepare()
        twilio_call.stream_call = stream_call

        await twilio.attach_phone_to_call(stream_call, twilio_stream, phone_user.id)

        async with agent.join(stream_call, participant_wait_timeout=0):
            await agent.llm.simple_response(
                text="Greet the caller and ask what kind of music they'd like you to generate."
            )
            await twilio_stream.run()
    finally:
        call_registry.remove(call_sid)

async def run_with_server(from_number: str, to_number: str):
    """Start the server and initiate the outbound call once ready."""
    config = uvicorn.Config(app, host="localhost", port=8000, log_level="info")
    server = uvicorn.Server(config)

    server_task = asyncio.create_task(server.serve())

    while not server.started:
        await asyncio.sleep(0.1)

    logger.info("🚀 Server ready, initiating outbound call...")

    await initiate_outbound_call(from_number, to_number)

    await server_task

@click.command()
@click.option(
    "--from",
    "from_number",
    required=True,
    help="The Twilio phone number to call from (must be active in your Twilio account)",
)
@click.option("--to", "to_number", required=True, help="The phone number to call")
def main(from_number: str, to_number: str):
    logger.info(
        "Starting Lyria music generator outbound call. "
        "Note: latency is higher in dev. Deploy to US east for low latency."
    )
    asyncio.run(run_with_server(from_number, to_number))

if __name__ == "__main__":
    main()

Here is how it works. Using the above sample code, we create a Twilio-powered outbound phone-call agent that initializes a real Public Switched Telephony Network (PSTN) call, bridges the audio into a Stream call, and lets you generate Lyria 3-driven music by talking to a Gemini Live audio model in realtime.

To run this example in your Terminal, you should first execute the command ngrok http 8000 and keep the NGROK server running. Copy the app URL and run the following in the Terminal.

bash

1
NGROK_URL=https://90c0-176-72-38-94.ngrok-free.app uv run music_gen_via_phone_call.py --from +19810902211 --to +328458519934

From the above:

https://90c0-176-72-38-94.ngrok-free.app is your currently running NGROK app’s URL.
--from +19810902211 is your purchased Twilio phone number.
--to +328458519934 is the phone number you want to call.

Note: For an inbound phone call (call initiation from your phone to a Twilio number), the positions of the two phone numbers above must be switched.

If everything goes smoothly after running NGROK_URL=https://90c0-176-72-38-94.ngrok-free.app uv run music_gen_via_phone_call.py --from +19810902211 --to +328458519934, the familiar phone calling UI will be launched on your mobile device. On iPhone and iPad, it will open the built-in Phone app. The same is true for Android devices.

Troubleshooting Guide

As you just noticed, configuring and generating music via phone calls with Lyria 3 in Vision Agents involves several steps. This can result in an unsuccessful output. Here are some known issues and how to fix them when you get errors in the Terminal.

No Phone Call Interface: This may occur when there are spaces in the phone numbers (inbound or outbound). Both the specified numbers for inbound and outbound calls must not contain spaces.
Twilio Phone Number Not Active: To call the agent from your phone, the Twilio number must be activated and registered.
NGROK URL Not Specified: Ensure the NGROK server URL you specify on the Twilio dashboard matches the one currently in use. Running ngrok http 8000 will generate a different URL each time.

Where To Go From Here

Congratulations 🎉 👏. You have now discovered a fun and different ways to make short and full-length AI music with Lyria 3. We began with AI music generation with the Lyria 3 models via the Gemini API. Not only that, you created AI sounds in Vision Agents during video and phone calls.

Music generation with Lyria goes beyond what we covered in this article. The models allow you to steer their output in so many ways. For example, starting with guitar and mixing in piano somewhere in the middle. Check out the Lyria 3 docs to learn more about customizing your AI music generation output. Additionally, creating effective music with Lyria 3 involves mastering your text/voice prompts. Refer to the Lyria 3 prompt guide for details.

We focused on one of the many ways to use speech AI for fun: generating AI music with nothing but your voice over the phone. The Vision Agents repo has several other examples to explore, build with the open-source framework, and deploy anywhere you want.