Android AI Voice Assistant Tutorial

This tutorial teaches you how to quickly build a production-ready voice AI agent with OpenAI realtime using Stream’s video edge network, Kotlin, and Node.

The instructions to the agent are sent server-side (node) so you can do function calling or RAG
The integration uses Stream’s video edge network (for low latency) and WebRTC (so it works under slow/unreliable network conditions)
You have full control over the AI setup and visualization

The result will look something like this:

While this tutorial uses Node + Kotlin you could achieve something similar with any other backend language + Stream SDK. (Swift, Kotlin, React, JS, Flutter, React Native, Unity, etc)

Step 1 - Connect AI agent to Stream from your backend

First, we will get an Agent app that will add an AI agent to a call in a demo application. Along the way, we will introduce the basic concepts that make our AI voice assistant offering.

Step 1.1 - OpenAI and Stream credentials

In order to get started you first need to have an OpenAI account and an API key. Please note that the OpenAI credentials will never be shared client-side and only exchanged between yours and Stream servers.

Additionally, you will need a Stream account and use the API key and secret from the Stream dashboard.

Step 1.2 - Create the Node.js project

Make sure that you are using a recent version of Node.js such as 22 or later, you can check that with node -v

First, let’s create a new folder called “openai-audio-tutorial”. From the terminal, go to the folder, and run the following command:

bash

1
npm init -y

This command generates a package.json file with default settings.

Step 1.3 - Installing the dependencies

Next, let’s update the generated package.json with the following content:

package.json (json)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{
  "name": "@stream-io/video-ai-demo-server",
  "type": "module",
  "dependencies": {
    "@hono/node-server": "^1.13.8",
    "@stream-io/node-sdk": "^0.4.14",
    "@stream-io/openai-realtime-api": "^0.1.0",
    "dotenv": "^16.3.1",
    "hono": "^4.7.4",
    "open": "^10.1.0"
  },
  "scripts": {
    "server": "node ./server.mjs",
    "standalone-ui": "node ./standalone.mjs"
  }
}

Then, run the following command to install the dependencies:

bash

1
npm install

Step 1.4 - Setup the credentials

Create a .env file in the project root with the following variables:

.env (text)

1
2
3
4
5
6
# Stream API credentials
STREAM_API_KEY=your_stream_api_key
STREAM_API_SECRET=your_stream_api_secret

# OpenAI API key
OPENAI_API_KEY=your_openai_api_key

Then edit the .env file with your actual API keys from Step 1.1. You can find the keys on your Stream Dashboard:

Step 1.5 - Implement the standalone-ui script

Before diving into the Android integrations, we will build a simple server integration that will show how to connect to the AI agent to a call and connect to it with a simple webapp.

Create a file called standalone.mjs and paste this content.

standalone.mjs (js)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import { config } from 'dotenv';
import { StreamClient } from '@stream-io/node-sdk';
import open from 'open';
import crypto from 'crypto';

// load config from dotenv
config();

async function main() {
    // Get environment variables
    const streamApiKey = process.env.STREAM_API_KEY;
    const streamApiSecret = process.env.STREAM_API_SECRET;
    const openAiApiKey = process.env.OPENAI_API_KEY;

    // Check if all required environment variables are set
    if (!streamApiKey || !streamApiSecret || !openAiApiKey) {
        console.error("Error: Missing required environment variables, make sure to have a .env file in the project root, check .env.example for reference");
        process.exit(1);
    }

    const streamClient = new StreamClient(streamApiKey, streamApiSecret);
    const call = streamClient.video.call("default", crypto.randomUUID());

    // realtimeClient is https://github.com/openai/openai-realtime-api-beta openai/openai-realtime-api-beta
    const realtimeClient = await streamClient.video.connectOpenAi({
        call,
        openAiApiKey,
        agentUserId: "lucy",
    });

    // Set up event handling, all events from openai realtime api are available here see: https://platform.openai.com/docs/api-reference/realtime-server-events
    realtimeClient.on('realtime.event', ({ time, source, event }) => {
        console.log(`got an event from OpenAI ${event.type}`);
        if (event.type === 'response.audio_transcript.done') {
            console.log(`got a transcript from OpenAI ${event.transcript}`);
        }
    });

    realtimeClient.updateSession({
        instructions: "You are a helpful assistant that can answer questions and help with tasks.",
    });

    // Get token for the call
    const token = streamClient.generateUserToken({user_id:"theodore"});

    // Construct the URL, TODO: replace this with 
    const callUrl = `https://pronto.getstream.io/join/${call.id}?type=default&api_key=${streamClient.apiKey}&token=${token}`;

    // Open the browser
    console.log(`Opening browser to join the call... ${callUrl}`);
    await open(callUrl);
}

main().catch(error => {
    console.error("Error:", error);
    process.exit(1);
});

Step 1.6 - Running the sample

At this point, we can run the script with this command:

bash

1
npm run standalone-ui

This will open your browser and connect you to a call where you can talk to the OpenAI agent. As you talk to the agent, you will notice your shell will contain logs for each event that OpenAI is sending.

Let’s take a quick look at what it happening in the server-side code we just added:

Here we instantiate Stream Node SDK with the API credentials and then use that to create a new call object. That call will be used to host the conversation between the user and the AI agent.

1
2
const streamClient = new StreamClient(streamApiKey, streamApiSecret);
const call = streamClient.video.call("default", crypto.randomUUID());

The next step, is to have the Agent connect to the call and obtain a OpenAI Realtime API Client. The connectOpenAi function does the following things: it instantiate the Realtime API client and then uses Stream API to connect the agent to the call. The agent will connect to the call as a user with ID "lucy"

1
2
3
4
5
const realtimeClient = await streamClient.video.connectOpenAi({
    call,
    openAiApiKey,
    agentUserId: "lucy",
});

We then use the realtimeClient object to pass instructions to OpenAI and to listen to events emitted by OpenAI. The interesting bit is that realtimeClient is an instance of OpenAI’s official API client. This gives you full control of what you can do with OpenAI

1
2
3
4
5
6
7
8
9
10
realtimeClient.on('realtime.event', ({ time, source, event }) => {
    console.log(`got an event from OpenAI ${event.type}`);
    if (event.type === 'response.audio_transcript.done') {
        console.log(`got a transcript from OpenAI ${event.transcript}`);
    }
});

realtimeClient.updateSession({
    instructions: "You are a helpful assistant that can answer questions and help with tasks.",
});

Step 2 - Setup your server-side integration

This example was pretty simple to set up and showcases how easy it is to add an AI bot to a Stream call. When building a real application, you will need your backend to handle authentication for your clients as well as send instructions to OpenAI (RAG, function calling in most applications, needs to run on your backend).

So the backend we are going to build will take care of two things:

Generate a valid token to the Android app to join the call running on Stream
Use Stream APIs to join the same call with the AI agent and set it up with instructions

Step 2.1 - Implement the server.mjs

Create a new file in the same project, called server.mjs, and add the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import { serve } from "@hono/node-server";
import { StreamClient } from "@stream-io/node-sdk";
import { Hono } from "hono";
import crypto from 'crypto';
import { config } from 'dotenv';

// load config from dotenv
config();

// Get environment variables
const streamApiKey = process.env.STREAM_API_KEY;
const streamApiSecret = process.env.STREAM_API_SECRET;
const openAiApiKey = process.env.OPENAI_API_KEY;

// Check if all required environment variables are set
if (!streamApiKey || !streamApiSecret || !openAiApiKey) {
    console.error("Error: Missing required environment variables, make sure to have a .env file in the project root, check .env.example for reference");
    process.exit(1);
}

const app = new Hono();
const streamClient = new StreamClient(streamApiKey, streamApiSecret);

/**
 * Endpoint to generate credentials for a new video call.
 * Creates a unique call ID, generates a token, and returns necessary connection details.
 */
app.get("/credentials", (c) => {
  console.log("got a request for credentials");
  // Generate a shorter UUID for callId (first 12 chars)
  const callId = crypto.randomUUID().replace(/-/g, '').substring(0, 12);
  // Generate a shorter UUID for userId (first 8 chars with prefix)
  const userId = `user-${crypto.randomUUID().replace(/-/g, '').substring(0, 8)}`;
  const callType = "default";
  const token = streamClient.generateUserToken({
    user_id: userId,
  });
  return c.json({
    apiKey: streamApiKey,
    token,
    callType,
    callId,
    userId
  });
});

/**
 * Endpoint to connect an AI agent to an existing video call.
 * Takes call type and ID parameters, connects the OpenAI agent to the call,
 * sets up the realtime client with event handlers and tools,
 * and returns a success response when complete.
 */
app.post("/:callType/:callId/connect", async (c) => {
  console.log("got a request for connect");
  const callType = c.req.param("callType");
  const callId = c.req.param("callId");

  const call = streamClient.video.call(callType, callId);
  const realtimeClient = await streamClient.video.connectOpenAi({
    call,
    openAiApiKey,
    agentUserId: "lucy",
  });
  await setupRealtimeClient(realtimeClient);
  console.log("agent is connected now");
  return c.json({ ok: true });
});

async function setupRealtimeClient(realtimeClient) {
  realtimeClient.on("error", (event) => {
    console.error("Error:", event);
  });

  realtimeClient.on("session.update", (event) => {
    console.log("Realtime session update:", event);
  });

  realtimeClient.updateSession({
    instructions: "You are a helpful assistant that can answer questions and help with tasks.",
    turn_detection: { type: "semantic_vad" },
    input_audio_transcription: { model: "gpt-4o-transcribe" },
    input_audio_noise_reduction: { type: "near_field" },
  });

  realtimeClient.addTool(
    {
      name: "get_weather",
      description:
        "Call this function to retrieve current weather information for a specific location. Provide the city name.",
      parameters: {
        type: "object",
        properties: {
          city: {
            type: "string",
            description: "The name of the city to get weather information for",
          },
        },
        required: ["city"],
      },
    },
    async ({ city, country, units = "metric" }) => {
      console.log("get_weather request", { city, country, units });
      try {
        // This is a placeholder for actual weather API implementation
        // In a real implementation, you would call a weather API service here
        const weatherData = {
          location: country ? `${city}, ${country}` : city,
          temperature: 22,
          units: units === "imperial" ? "°F" : "°C",
          condition: "Partly Cloudy",
          humidity: 65,
          windSpeed: 10
        };
        
        return weatherData;
      } catch (error) {
        console.error("Error fetching weather data:", error);
        return { error: "Failed to retrieve weather information" };
      }
    },
  );

  return realtimeClient;
}

// Start the server
serve({
  fetch: app.fetch,
  hostname: "0.0.0.0",
  port: 3000,
});

console.log(`Server started on :3000`);

In the code above, we set up two endpoints: /credentials, which generates a unique call ID and authentication token, and /:callType/:callId/connect, which connects the AI agent (that we call “lucy”) to a specific video call.

The assistant follows predefined instructions, in this case trying to be helpful with tasks. Based on the purpose of your AI bot, you should update these instructions accordingly. In the same updateSession call we instruct OpenAI to use the semantic classifier for voice activity detection to detect when the user has finished speaking, a GPT-4o based model for transcriptions, and near-field noise reduction for audio.

We also show an example of a function call, using the get_weather tool.

Step 2.2 - Running the server

We can run the server now, this will launch a server and listen on port:3000

bash

1
npm run server

To make sure everything is working as expected, you can run a curl GET request from your terminal.

bash

1
curl -X GET http://localhost:3000/credentials

As a result, you should see credentials required to join the call. With that, we’re all setup server-side!

Step 3 - Setting up the Android project

Now, let’s switch to the Android app, which will connect to this API and provide visualizations of the AI’s audio levels.

Step 3.1 - Adding the Stream Video dependency

Let’s create a new project, such as AIVideoDemo, and add the StreamVideo ANDROID SDK.

Follow the steps here to add the SDK as a dependency to your project.

Step 3.2 - Add other dependency (NEW)

Since you need to use your speech to interact with the AI you need microphone permission so we will use the library from accompanist

Since you need to make network calls so you need to include retrofit

You can copy the content from the following libs.versions.toml and build.gradle and make some adjustments as per your need.

libs.versions.toml (kotlin)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
[versions]
agp = "8.8.1"
kotlin = "2.0.0"
coreKtx = "1.15.0"
lifecycleRuntimeKtx = "2.8.7"
activityCompose = "1.10.0"
composeBom = "2024.04.01"
stream = "1.3.1"
retrofit = "2.11.0"
accompanist = "0.33.1-alpha"
kotlinxSerializationJson="1.6.0"

[libraries]
androidx-core-ktx = { group = "androidx.core", name = "core-ktx", version.ref = "coreKtx" }
androidx-lifecycle-runtime-ktx = { group = "androidx.lifecycle", name = "lifecycle-runtime-ktx", version.ref = "lifecycleRuntimeKtx" }
androidx-lifecycle-runtime-compose = {group = "androidx.lifecycle", name = "lifecycle-runtime-compose", version.ref = "lifecycleRuntimeKtx"}
androidx-lifecycle-viewmodel-compose = {group = "androidx.lifecycle", name = "lifecycle-viewmodel-compose", version.ref = "lifecycleRuntimeKtx"}

androidx-activity-compose = { group = "androidx.activity", name = "activity-compose", version.ref = "activityCompose" }
androidx-activity-ktx = { group = "androidx.activity", name = "activity-ktx", version.ref = "activityCompose" }
androidx-compose-bom = { group = "androidx.compose", name = "compose-bom", version.ref = "composeBom" }

androidx-ui = { group = "androidx.compose.ui", name = "ui" }
androidx-ui-graphics = { group = "androidx.compose.ui", name = "ui-graphics" }
androidx-ui-tooling-preview = { group = "androidx.compose.ui", name = "ui-tooling-preview" }
androidx-material3 = { group = "androidx.compose.material3", name = "material3" }

getstream-video-android-ui-core = { group = "io.getstream", name = "stream-video-android-ui-core", version.ref = "stream"}
retrofit = {group = "com.squareup.retrofit2", name = "retrofit", version.ref = "retrofit" }
retrofit-gson = {group = "com.squareup.retrofit2", name = "converter-gson", version.ref = "retrofit" }
accompanist-permissions = {group = "com.google.accompanist", name = "accompanist-permissions", version.ref = "accompanist" }
kotlinx-serialization-json = { group = "org.jetbrains.kotlinx", name="kotlinx-serialization-json", version.ref = "kotlinxSerializationJson" }

build.gradle (gradle)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
dependencies {
	
    implementation(libs.androidx.core.ktx)
    implementation(libs.androidx.lifecycle.runtime.ktx)
    implementation(libs.androidx.lifecycle.runtime.compose)
    implementation(libs.androidx.lifecycle.viewmodel.compose)
    implementation(libs.androidx.activity.compose)
    implementation(libs.androidx.activity.ktx)
    implementation(platform(libs.androidx.compose.bom))
    implementation(libs.androidx.ui)
    implementation(libs.androidx.ui.graphics)
    implementation(libs.androidx.ui.tooling.preview)
    implementation(libs.androidx.material3)

    implementation(libs.getstream.video.android.ui.core)
    implementation(libs.retrofit)
    implementation(libs.retrofit.gson)
    implementation(libs.kotlinx.serialization.json)
    implementation(libs.accompanist.permissions)

}

Step 4 - Stream Video Setup

Step 4.1 - Declaring the required properties

It’s time to write some Android code.

Create a new file named ApiService.kt and add the following code. This class will act as our network layer.

ApiService.kt (kotlin)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
package io.getstream.ai.audiodemo

import io.getstream.ai.audiodemo.data.Credentials
import okhttp3.ResponseBody
import retrofit2.Retrofit
import retrofit2.converter.gson.GsonConverterFactory
import retrofit2.http.GET
import retrofit2.http.POST
import retrofit2.http.Path

interface ApiService {
    @GET("credentials")
    suspend fun getCredentials(): Credentials

    @POST("{callType}/{channelId}/connect")
    suspend fun connectAI(
        @Path("callType", encoded = true) callType: String,
        @Path("channelId", encoded = true) channelId: String): ResponseBody
}

object RetrofitInstance {
    private val BASE_URL = "http://10.0.2.2:3000/"

    val api: ApiService by lazy {
        Retrofit.Builder()
            .baseUrl(BASE_URL)
            .addConverterFactory(GsonConverterFactory.create())
            .build()
            .create(ApiService::class.java)
    }
}

Create a file named MainViewModel.kt. This class will be responsible for

Making the network calls
Holding UI States
Holding the StreamVideo and Call object

MainViewModel.kt (kotlin)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
class MainViewModel(val app: Application) : AndroidViewModel(app) {

    var callUiState = MutableStateFlow(CallUiState.IDLE)
    val credentials = MutableStateFlow<DataLayerResponse<Credentials>>(DataLayerResponse.Initial("Initial State"))
    var client: StreamVideo? = null
    var call: Call? = null

    fun initCredentials() {
        viewModelScope.launch {
            try {
                credentials.emit(DataLayerResponse.Initial("Getting Credentials"))
                val response = RetrofitInstance.api.getCredentials()
                createStreamClient(response)
                credentials.emit(DataLayerResponse.Success(response))
            } catch (ex: Exception) {
                credentials.emit(DataLayerResponse.Error(ex.message))
            }
        }
    }

    private fun createStreamClient(credentials: Credentials) {
        val userId = credentials.userId

        val user = User(
            id = userId, 
            name = "Tutorial", 
            image = "https://bit.ly/2TIt8NR",
        )

        client = StreamVideoBuilder(
            context = app.applicationContext,
            apiKey = credentials.apiKey,
            geo = GEO.GlobalEdgeNetwork,
            user = user,
            token = credentials.token,
        ).build()
    }

    fun joinCall() {

        client?.let { client->
            val credentials = (credentials.value as DataLayerResponse.Success).data
            call = client.call(credentials.callType, credentials.callId)
            connectAi(call!!, credentials.callId, credentials.callType)
        }

    }

    private fun connectAi(call: Call, channelId: String, callType: String) {
        viewModelScope.launch {
            try {
                callUiState.emit(CallUiState.JOINING)
                val encodedChannelId = URLEncoder.encode(channelId, StandardCharsets.UTF_8.toString())
                val encodedCallType = URLEncoder.encode(callType, StandardCharsets.UTF_8.toString())
                val response = RetrofitInstance.api.connectAI(encodedCallType, encodedChannelId)
                call.join(create = true)
                callUiState.emit(CallUiState.ACTIVE)
            }catch (ex: Exception) {
                ex.printStackTrace()
                callUiState.emit(CallUiState.ERROR)
            }

        }
    }

    fun disconnect(){
        viewModelScope.launch {
            call?.end()
            callUiState.emit(CallUiState.IDLE)
        }
    }
}

sealed class DataLayerResponse<T> {
    class Success<T>(val data: T): DataLayerResponse<T>()
    class Error<T>(val message: String?): DataLayerResponse<T>()
    class Initial<T>(val message: String): DataLayerResponse<T>()
}

Create a file named MicrophonePermissionScreen.kt and copy this content. This will help you with boilerplate code of asking the microphone permission.

MicrophonePermissionScreen.kt (kotlin)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
package io.getstream.ai.audiodemo

@OptIn(ExperimentalPermissionsApi::class)
@Composable
fun MicrophonePermissionScreen(onPermissionGranted: () -> Unit) {
    val microphonePermissionState =
        rememberPermissionState(android.Manifest.permission.RECORD_AUDIO)

    LaunchedEffect(Unit) {
        if (!microphonePermissionState.status.isGranted) {
            microphonePermissionState.launchPermissionRequest()
        }
    }

    LaunchedEffect(microphonePermissionState.status.isGranted) {
        if (microphonePermissionState.status.isGranted) {
            onPermissionGranted()
        }
    }

    Column(
        modifier = Modifier
            .fillMaxSize()
            .padding(16.dp),
        horizontalAlignment = Alignment.CenterHorizontally,
        verticalArrangement = Arrangement.Center
    ) {
        when {
            microphonePermissionState.status.isGranted -> {}
            microphonePermissionState.status.shouldShowRationale -> {
                Column(horizontalAlignment = Alignment.CenterHorizontally) {
                    Text(
                        text = "Microphone access is required to proceed.",
                        fontSize = 18.sp,
                        color = Color.White
                    )
                    Spacer(modifier = Modifier.height(8.dp))
                    Button(onClick = { microphonePermissionState.launchPermissionRequest() }) {
                        Text(text = "Grant Permission", color = Color.White)
                    }
                }
            }

            else -> {
                Column(horizontalAlignment = Alignment.CenterHorizontally) {
                    Text(
                        text = "Microphone permission is needed to proceed.",
                        fontSize = 18.sp,
                        color = Color.White
                    )
                    Spacer(modifier = Modifier.height(8.dp))
                    Button(onClick = { microphonePermissionState.launchPermissionRequest() }) {
                        Text(text = "Request Permission", color = Color.White)
                    }
                }
            }
        }
    }
}

Create a theme file named themes.xml and place it under res/values such that your application’s background is black.

themes.xml (xml)

1
2
3
4
5
6
<?xml version="1.0" encoding="utf-8"?>
<resources>
    <style name="Theme.AiVoiceDemo" parent="android:Theme.Material.Light.NoActionBar" >
        <item name="android:windowBackground">@color/black</item>
    </style>
</resources>

Create an xml file named as call_end.xml and place it under res/drawable . You will use this vector icon for disconnecting the call with AI

call_end.xml (xml)

1
2
3
<vector xmlns:android="http://schemas.android.com/apk/res/android" android:height="24dp" android:tint="#000000" android:viewportHeight="24" android:viewportWidth="24" android:width="24dp">
  <path android:fillColor="@android:color/white" android:pathData="M12,9c-1.6,0 -3.15,0.25 -4.6,0.72v3.1c0,0.39 -0.23,0.74 -0.56,0.9 -0.98,0.49 -1.87,1.12 -2.66,1.85 -0.18,0.18 -0.43,0.28 -0.7,0.28 -0.28,0 -0.53,-0.11 -0.71,-0.29L0.29,13.08c-0.18,-0.17 -0.29,-0.42 -0.29,-0.7 0,-0.28 0.11,-0.53 0.29,-0.71C3.34,8.78 7.46,7 12,7s8.66,1.78 11.71,4.67c0.18,0.18 0.29,0.43 0.29,0.71 0,0.28 -0.11,0.53 -0.29,0.71l-2.48,2.48c-0.18,0.18 -0.43,0.29 -0.71,0.29 -0.27,0 -0.52,-0.11 -0.7,-0.28 -0.79,-0.74 -1.69,-1.36 -2.67,-1.85 -0.33,-0.16 -0.56,-0.5 -0.56,-0.9v-3.1C15.15,9.25 13.6,9 12,9z"/>  
</vector>

Because we are setting up in a local host environment, we need to set this as well.

Create a file named network_security_config.xml and place it under res/xml

network_security_config.xml (xml)

1
2
3
4
5
6
<?xml version="1.0" encoding="utf-8"?>
<network-security-config>
  <domain-config cleartextTrafficPermitted="true">
    <domain includeSubdomains="true">10.0.2.2</domain>
  </domain-config>
</network-security-config>

Update with this code in the AndroidManifest.xml

AndroidManifest.xml (xml)

1
2
3
<uses-permission android:name="android.permission.INTERNET" />

<application android:networkSecurityConfig="@xml/network_security_config" .../>

After declaring these properties, we need to set them up. We will use the streamVideo object to communicate with Stream’s Video API and store the relevant call information in the call object.

We will fetch the credentials required to set up the streamVideo object and the call from the Node.js server API we created above and populate the credentials value with the response from the Node.js server.

Let’s add the Credentials model that reflects this response:

kotlin

1
2
3
4
5
6
7
8
9
package io.getstream.ai.audiodemo

data class Credentials(
     val apiKey: String,
     val token: String,
     val callId: String,
     val callType: String,
     val userId: String
)

We are also declaring a callUiState enum of type CallUiState, allowing us to show different UI, depending on the state of the app. Here are the possible values of this enum:

kotlin

1
2
3
enum class CallUiState {
    IDLE, JOINING, ACTIVE, ERROR
}

Step 4.2 - Fetching the credentials

Now, we will create a method to fetch the credentials from our server. Add the following code in the MainViewModel.kt file (It should be already added if you have followed above steps)

MainViewModel.kt (kotlin)

1
2
3
4
5
6
7
8
9
10
11
fun initCredentials() {
    viewModelScope.launch {
        try {
            credentials.emit(DataLayerResponse.Initial("Getting Credentials"))
            val response = RetrofitInstance.api.getCredentials()
            createStreamClient(response)
            credentials.emit(DataLayerResponse.Success(response))
        } catch (ex: Exception) {
            credentials.emit(DataLayerResponse.Error(ex.message))
        }
  }

This method sends a POST request to fetch the credentials to set up the StreamVideo object and get the call data.

Note: We are using “localhost” here (as defined in the BASE_URL property), the simplest way to test this is to run on a simulator. You can also test this on a real device, to do that you set BASE_URL to your local network IP address instead. Additionally, your device and your computer should be on the same WiFi network and you need to allow “Arbitrary Loads” and “Local Networking” in your plist (the local server uses HTTP and not HTTPS).

Step 4.3 - Connecting to Stream Video

We now have the credentials, and we can connect to Stream Video. To do this, add the following code:

kotlin

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
private fun createStreamClient(credentials: Credentials) {
      val userId = credentials.userId
      
      val user = User(
          id = userId, 
          name = "Tutorial", 
          image = "https://bit.ly/2TIt8NR",
      )

      client = StreamVideoBuilder(
          context = app.applicationContext,
          apiKey = credentials.apiKey,
          geo = GEO.GlobalEdgeNetwork,
          user = user,
          token = credentials.token,
      ).build()
}

This method fetches the credentials, creates a StreamVideo object, and connects to it.

We will call this method on the appearance of the root view in our next step.

Step 5 - Building the UI

We can now start building the UI for our app. Replace the contents of MainActivity.kt with the following code:

MainActivity.kt (kotlin)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
class MainActivity : ComponentActivity() {

    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)
        enableEdgeToEdge()

        setContent {
            var isPermissionGranted by remember { mutableStateOf(false) }
            if (isPermissionGranted) {
                Body()
            } else {
                MicrophonePermissionScreen(onPermissionGranted = {
                    isPermissionGranted = true
                })
            }
        }
    }
}

@Composable
fun Body() {
    val viewModel: MainViewModel = viewModel(key = "MainViewModel")
    LaunchedEffect(Unit) {
        viewModel.initCredentials()
    }
    val creds = viewModel.credentials.collectAsStateWithLifecycle()
    when (creds.value) {
        is DataLayerResponse.Initial -> {
            val message = (creds.value as DataLayerResponse.Initial<Credentials>).message
            Box(
                Modifier
                    .fillMaxSize()
                    .background(Color.Black),
                contentAlignment = Alignment.Center
            ) {
                Text(message, color = Color.White)
            }
        }

        is DataLayerResponse.Success -> {
            Box(
                modifier = Modifier
                    .fillMaxSize()
                    .background(Color.Black)
            ) {
                AiContentUi()
            }

        }

        is DataLayerResponse.Error -> {
            Box(
                Modifier
                    .fillMaxSize()
                    .background(Color.Black),
                contentAlignment = Alignment.Center
            ) {
                Text("Getting Credentials Failed", color = Color.White)
            }
        }
    }
}

@Composable
fun AiContentUi() {
    val viewModel: MainViewModel = viewModel(key = "MainViewModel")
    val callUiState by viewModel.callUiState.collectAsStateWithLifecycle()

    when (callUiState) {
        CallUiState.IDLE -> {
            Box(Modifier.fillMaxSize(), contentAlignment = Alignment.Center) {
                Button(colors = ButtonDefaults.buttonColors()
                    .copy(contentColor = Color.White, containerColor = Color.Black),
                    onClick = {
                        viewModel.joinCall()
                    }) {
                    Text("Click to talk to Ai", color = Color.White)
                }
            }

        }

        CallUiState.JOINING -> {
            Box(Modifier.fillMaxSize(), contentAlignment = Alignment.Center) {
                Row(
                    horizontalArrangement = Arrangement.Center,
                    verticalAlignment = Alignment.CenterVertically
                ) {
                    Text(
                        "Waiting for AI Agent to Join...",
                        color = Color.White,
                        modifier = Modifier.padding(bottom = 12.dp)
                    )
                    Spacer(Modifier.width(8.dp))
                    CircularProgressIndicator(
                        modifier = Modifier.width(24.dp),
                        color = MaterialTheme.colorScheme.secondary,
                        trackColor = MaterialTheme.colorScheme.surfaceVariant,
                    )
                }
            }
        }

        CallUiState.ACTIVE -> {
            Box(Modifier.fillMaxSize()) {
                AISpeakingContainerView(viewModel.call?.state!!)
                CallEndButton(modifier = Modifier
                    .align(Alignment.BottomEnd)
                    .padding(12.dp), onClick = {
                    viewModel.disconnect()
                })
            }
        }

        CallUiState.ERROR -> {
            Box(Modifier.fillMaxSize(), contentAlignment = Alignment.Center) {
                Column(horizontalAlignment = Alignment.CenterHorizontally) {
                    Text("Something went wrong, please re-try", color = Color.White)
                    Button(colors = ButtonDefaults.buttonColors()
                        .copy(contentColor = Color.White, containerColor = Color.Black),
                        onClick = {
                            viewModel.joinCall()
                        }) {
                        Text("Connect to Ai Agent", color = Color.White)
                    }
                }
            }
        }
    }
}

@Composable
fun CallEndButton(modifier: Modifier, onClick: () -> Unit) {
    IconButton(
        onClick = onClick,
        modifier = modifier
            .size(56.dp)
            .clip(CircleShape)
            .background(Color.Red)
    ) {
        Icon(
            painter = painterResource(id = R.drawable.call_end),
            contentDescription = "End call",
            tint = Color.White
        )
    }
}

enum class CallUiState {
    IDLE, JOINING, ACTIVE, ERROR
}

As mentioned above, we first connect to StreamVideo , via the viewModel.initCredentials() method. We are storing the connection flow into a MainViewModel

Next, we are showing a different UI based on the state of the callUiState.

When the state is ACTIVE, we show a new AISpeakingView. This view will show a nice audio visualization when the current user and AI speak. We will provide more details for this view in the next section. It’s enough to declare it with a simple TODO text.

kotlin

1
2
3
4
@Composable
fun AISpeakingView(callState: CallState) {
	//TODO
}

Additionally, we are adding an overlay that shows a button for leaving the call. For this, we are using the CallEndButton

Next, when the state is JOINING, we show appropriate text and a progress view.

When the state is IDLE, we show a button with the text “Click to talk to AI”. When the button is tapped, we are calling the joinCall method that will join the call with the AI bot:

kotlin

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
fun joinCall() {
    client?.let { client->
        val credentials = (credentials.value as DataLayerResponse.Success).data
        call = client.call(credentials.callType, credentials.callId)
        connectAi(call!!, credentials.callId, credentials.callType)
    }
 }

private fun connectAi(call: Call, channelId: String, callType: String) {
      viewModelScope.launch {
          try {
              callUiState.emit(CallUiState.JOINING)
              val encodedChannelId =
                  URLEncoder.encode(channelId, StandardCharsets.UTF_8.toString())
              val encodedCallType = URLEncoder.encode(callType, StandardCharsets.UTF_8.toString())
              val response = RetrofitInstance.api.connectAI(encodedCallType, encodedChannelId)
              call.join(create = true)
              callUiState.emit(CallUiState.ACTIVE)
          } catch (ex: Exception) {
              ex.printStackTrace()
              callUiState.emit(CallUiState.ERROR)
          }
      }
  }

At this point, you can run the app, join a call, and converse with the AI agent. However, we can take this step further and show nice visualizations based on the audio levels of the participants.

Step 6 - Visualizing the audio levels

Let’s implement the AISpeakingView next. This view will listen to the audio levels provided by the call state, for each of its participants. It will then visualize them, with a nice glowing animation, which expands and contracts based on the user’s voice amplitude. Additionally, it subtly rotates and changes shape.

Step 6.1 - AISpeakingView

Replace the existing TODO content, with the following implementation:

kotlin

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
enum class SpeakerState {
    AI_SPEAKING,
    USER_SPEAKING,
    IDLE
}

fun SpeakerState.gradientColors(): List<Color> {
    return when (this) {
        SpeakerState.USER_SPEAKING -> listOf(
            Color.Red,
            Color.Red.copy(alpha = 0f)
        )
        else -> listOf(
            Color(0f, 0.976f, 1f),
            Color(0f, 0.227f, 1f, 0f)
        )
    }
}

@Composable
fun AISpeakingView(callState: CallState) {
    val agentId = "lucy"

    var amplitude by remember { mutableFloatStateOf(0f) }
    var audioLevels by remember { mutableStateOf(listOf<Float>()) }
    var speakerState by remember { mutableStateOf(SpeakerState.IDLE) }

    GlowView(amplitude, speakerState)

    LaunchedEffect(Unit) {

        callState.activeSpeakers.collectLatest { speakers->
            speakers.forEach { speaker->
                speaker.audioLevels.collectLatest { audioLevel->
                    if(speaker.userId.value.contains("lucy")) {
                        if(speakerState!=SpeakerState.AI_SPEAKING) {
                            speakerState = SpeakerState.AI_SPEAKING
                        }
                    } else {
                        if(speakerState!=SpeakerState.USER_SPEAKING) {
                            speakerState = SpeakerState.USER_SPEAKING
                        }
                    }
                    audioLevels = audioLevel
                    amplitude =
                        computeSingleAmplitude(audioLevels) * getRandomFloatInRange(1f, 2f)
                }
            }
        }
    }

    LaunchedEffect(Unit) {

        callState.activeSpeakers.collectLatest { speaker->
            val aiSpeaker = speaker.find { it.userId.value.contains(agentId) }

            // Find the local user speaking
            val localSpeaker = speaker.find { it.userId.value == callState.me.value!!.userId.value }
            if(aiSpeaker == null && localSpeaker == null){
                speakerState = SpeakerState.IDLE
                audioLevels = emptyList()
                amplitude = 0f
            }
        }
    }
}

fun computeSingleAmplitude(levels: List<Float>): Float {
    val normalized = normalizePeak(levels)
    if (normalized.isEmpty()) return 0f

    return normalized.average().toFloat()
}

fun normalizePeak(levels: List<Float>): List<Float> {
    val maxLevel = levels.maxOfOrNull { abs(it) } ?: return levels
    return if (maxLevel > 0) levels.map { it / maxLevel } else levels
}

Before we dive into the GlowView implementation, let’s discuss the callState.activeSpeakers.collectLatest flow, that listens to the change of the activeSpeakers in the callState.

We call our agent “lucy”, and based on that, we filter out this participant from the current user. We show a different color depending on who is speaking. If the AI speaks, we show a blue color with different gradients. When the current user is speaking, we use a red color instead.

This is represented by the enum SpeakerState:

kotlin

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
enum class SpeakerState {
    AI_SPEAKING,
    USER_SPEAKING,
    IDLE
}

fun SpeakerState.gradientColors(): List<Color> {
    return when (this) {
        SpeakerState.USER_SPEAKING -> listOf(
            Color.Red,
            Color.Red.copy(alpha = 0f)
        )
        else -> listOf(
            Color(0f, 0.976f, 1f),
            Color(0f, 0.227f, 1f, 0f)
        )
    }
}

The audioLevels array is the data source for these visualizations. It’s filled with the values of each participant’s audioLevels array, which are available in each participant’s state.

These values then compute the current amplitude, which we pass to the glow view.

Step 6.2 - GlowView

Let’s implement the GlowView next. This view will contain three glow layers, which will be animated with a subtle rotating animation.

For each layer, we will define the minimum and maximum values of the radius size, brightness, blurring, opacity, and wave length.

Feel free to adjust these values to customize the animation.

kotlin

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
@Composable
fun GlowView(
    amplitude: Float,
    speakerState: SpeakerState = SpeakerState.AI_SPEAKING
) {
    // Animated amplitude for smooth transitions
    val animatedAmplitude = remember { Animatable(0f) }

    // Update animated amplitude when input amplitude changes
    LaunchedEffect(amplitude) {
        animatedAmplitude.animateTo(
            targetValue = amplitude,
            animationSpec = tween(600, easing = EaseInOut)
        )
    }

    // Continuous time value for wave animation
    var time by remember { mutableStateOf(0f) }

    // Rotation animation
    val infiniteRotation = rememberInfiniteTransition()
    val rotationAngle by infiniteRotation.animateFloat(
        initialValue = 0f,
        targetValue = 360f,
        animationSpec = infiniteRepeatable(
            animation = tween(10000, easing = LinearEasing),
            repeatMode = RepeatMode.Restart
        )
    )

    // Time progression for wave effect
    LaunchedEffect(Unit) {
        while (true) {
            time = (time + 0.005f) % 1.0f
            delay(16L)
        }
    }

    val gradientColors = speakerState.gradientColors()

    Box(
        modifier = Modifier
            .fillMaxSize()
            .background(Color.Black)
    ) {
        Canvas(modifier = Modifier.fillMaxSize()) {
            val center = Offset(size.width / 2, size.height / 2)

            // Outer Layer
            drawGlowLayer(
                center = center,
                baseRadiusMin = size.minDimension * 0.30f,
                baseRadiusMax = size.minDimension * 0.50f,
                blurRadius = 60f,
                baseOpacity = 0.35f,
                scaleRange = 0.3f,
                waveRangeMin = 0.2f,
                waveRangeMax = 0.02f,
                time = time,
                amplitude = animatedAmplitude.value,
                rotationAngle = rotationAngle,
                gradientColors = gradientColors
            )

            // Middle Layer
            drawGlowLayer(
                center = center,
                baseRadiusMin = size.minDimension * 0.20f,
                baseRadiusMax = size.minDimension * 0.30f,
                blurRadius = 40f,
                baseOpacity = 0.55f,
                scaleRange = 0.3f,
                waveRangeMin = 0.15f,
                waveRangeMax = 0.03f,
                time = time,
                amplitude = animatedAmplitude.value,
                rotationAngle = rotationAngle,
                gradientColors = gradientColors
            )

            // Inner Core Layer
            drawGlowLayer(
                center = center,
                baseRadiusMin = size.minDimension * 0.10f,
                baseRadiusMax = size.minDimension * 0.20f,
                blurRadius = 20f,
                baseOpacity = 0.9f,
                scaleRange = 0.5f,
                waveRangeMin = 0.35f,
                waveRangeMax = 0.05f,
                time = time,
                amplitude = animatedAmplitude.value,
                rotationAngle = rotationAngle,
                gradientColors = gradientColors
            )
        }
    }
}

private fun androidx.compose.ui.graphics.drawscope.DrawScope.drawGlowLayer(
    center: Offset,
    baseRadiusMin: Float,
    baseRadiusMax: Float,
    blurRadius: Float,
    baseOpacity: Float,
    scaleRange: Float,
    waveRangeMin: Float,
    waveRangeMax: Float,
    time: Float,
    amplitude: Float,
    rotationAngle: Float,
    gradientColors: List<Color>
) {
    // Calculate the actual radius based on amplitude
    val baseRadius = lerp(baseRadiusMin, baseRadiusMax, amplitude)

    // Calculate wave range (inverse relationship to amplitude)
    val waveRange = lerp(waveRangeMax, waveRangeMin, 1 - amplitude)

    // Calculate the scale factors
    val shapeWaveSin = sin(2 * PI * time).toFloat()
    val shapeWaveCos = cos(2 * PI * time).toFloat()

    // Scale from amplitude
    val amplitudeScale = 1.0f + scaleRange * amplitude

    // Final x/y scale = amplitude scale + wave
    val xScale = (amplitudeScale + waveRange * shapeWaveSin)
    val yScale = (amplitudeScale + waveRange * shapeWaveCos)

    // Draw the oval with gradient
    drawIntoCanvas { canvas ->
        val paint = androidx.compose.ui.graphics.Paint().asFrameworkPaint().apply {
            shader = RadialGradient(
                center.x, center.y, baseRadius,
                intArrayOf(
                    gradientColors[0].copy(alpha = 0.9f).toArgb(),
                    gradientColors[1].toArgb()
                ),
                floatArrayOf(0f, 1f),
                Shader.TileMode.CLAMP
            )
            alpha = (baseOpacity * 255).toInt()
        }

        // Apply blur
        paint.maskFilter = BlurMaskFilter(blurRadius, BlurMaskFilter.Blur.NORMAL)

        // Save the current state, rotate, draw, and restore
        canvas.save()
        canvas.rotate(rotationAngle, center.x, center.y)

        // Draw the oval with calculated dimensions
        canvas.nativeCanvas.drawOval(
            center.x - baseRadius * xScale,
            center.y - baseRadius * yScale,
            center.x + baseRadius * xScale,
            center.y + baseRadius * yScale,
            paint
        )

        canvas.restore()
    }
}

fun getRandomFloatInRange(min: Float, max: Float): Float {
    return Random.nextFloat() * (max - min) + min
}

Now, you can run the app, talk to the AI, and see beautiful visualizations while the participants speak.

You can find the source code of the Node.js backend here, while the completed Android tutorial can be found on the following page.

Recap

In this tutorial, we have built an example of an app that lets you talk with an AI bot, in an audio call.

We have shown you how to use Stream’s wrapper, which handles Web Socket communication with OpenAI’s real-time API, and create an AI agent with a few simple steps.

Additionally, we have shown you how to join a call from an Android app and presented the audio levels in a nice visualization.

Both the video SDK for Android and the API have plenty more features available to support more advanced use cases.

Next Steps

Explore the tutorials for other platforms: React, iOS, Flutter, React Native.
Check the Backend documentation with more examples in JS and Python.
Read more about the Android SDK documentation about additional features.

Android AI Voice Assistant Tutorial

Step 1 - Connect AI agent to Stream from your backend

Step 1.1 - OpenAI and Stream credentials

Step 1.2 - Create the Node.js project

Step 1.3 - Installing the dependencies

Step 1.4 - Setup the credentials

Step 1.5 - Implement the standalone-ui script

Step 1.6 - Running the sample

Step 2 - Setup your server-side integration

Step 2.1 - Implement the server.mjs

Step 2.2 - Running the server

Step 3 - Setting up the Android project

Step 3.1 - Adding the Stream Video dependency

Step 3.2 - Add other dependency (NEW)

Step 4 - Stream Video Setup

Step 4.1 - Declaring the required properties

Step 4.2 - Fetching the credentials

Step 4.3 - Connecting to Stream Video

Step 5 - Building the UI

Step 6 - Visualizing the audio levels

Step 6.1 - AISpeakingView

Step 6.2 - GlowView

Recap

Next Steps

Give us feedback!

Start coding for free