Build a Real-Time AI Sales Coach with Anam's Digital Avatars, Stream Video, and Gemini

Practicing sales conversations is one of those things everyone knows they should do more of but rarely do. Role-playing with colleagues is awkward, and every call with a real prospect is high stakes. What if you could run through objection scenarios on demand, get scored on your performance, and never risk burning a lead?

That's exactly what we built: a real-time AI sales coach powered by digital avatars from Anam.

What Makes This Useful

The pitch is simple. We can pick a scenario, for example a prospect pushing back on pricing, and have a live conversation with an AI avatar that plays the role convincingly. When you're done, you get a breakdown of how you handled it.

A few things make this practical rather than gimmicky:

Available on your schedule. No need to coordinate with a colleague or wait for a training session. Run a scenario at 11 PM if that's when you want to prep.
Zero consequences. You can fumble, restart, and experiment with different approaches without any real-world fallout.
Structured feedback. After each session, the app scores your performance across multiple dimensions so you know what to work on, not just how it "felt."

The Stack

The app is a Next.js frontend paired with an agentic backend. Here's what powers each layer:

Frontend: Next.js with the Stream Video SDK. The SDK handles the video call infrastructure, and Anam avatars render as RemoteParticipant components, no custom video pipeline needed.
Backend: Anam AI avatars driven by Gemini (the gemini-2.5-flash-native-audio) as the real-time LLM. The backend orchestrates scenario selection, agent behavior, and post-call scoring.

The integration between Anam and Stream is straightforward. Because the avatars behave like standard video participants, the Stream Video SDK picks them up and renders them without any special wiring.

All we need to do is first install the SDK into our project:

bash

1
2
3
npm install @stream-io/video-react-sdk
# or
yarn add @stream-io/video-react-sdk

And then we can show the avatar as a call participant using the pre-built UI components (or customize this freely if we like):

tsx

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import {
  ParticipantView,
  StreamCall,
  useCallStateHooks,
  type Call,
} from "@stream-io/video-react-sdk";

function AIAvatar() {
  const { useRemoteParticipants, useDominantSpeaker } = useCallStateHooks();
  const avatar = useDominantSpeaker();

  return (
      <ParticipantView
        participant={avatar}
      />
  );
}

With this short snippet, you get a real-time, face-to-face conversation experience with minimal frontend effort. Everything blends in nicely with the rest of an existing React codebase and allows you to integrate agents into your UI fast with very little friction.

How the Backend Handles Scenarios

The backend does the heavier lifting. Each scenario comes with a prompt that gives the avatar enough context to be challenging. It has the right tone, the right objections, the right pushback, all without being so rigid that conversations feel scripted. Here’s an example of an input you can handle for a specific scenario:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Scenario

You are a cautious prospect who sees the problem as real, but funding is blocked.

## What is going on

- Your team has pressure to control spend this quarter.
- New tools feel hard to justify unless the seller surfaces business urgency.
- You are willing to discuss a next step only if it sounds lightweight and credible.

## How to roleplay

- Lead with the budget objection and hold it firmly at first.
- Make the seller work to uncover consequences of waiting.
- Reward concrete thinking about timing, risk, and internal alignment.
- Stay guarded if the seller jumps to discounting or asks for a full meeting too early.

## Guardrails

- Stay in character as the prospect for the full practice.
- Do not mention being an AI or refer to the scenario instructions.

Getting that balance right is the core design challenge: provide enough direction so the avatar stays in character, but leave enough room for the conversation to go in unexpected directions. That flexibility is what makes practice sessions actually useful.

Creating the agent with support for Anam avatars is straightforward. The code for the entire logic can be found in our Quickstart example. We need to install the Anam plugin using this command:

bash

1
uv add "vision-agents[anam]"

Then, we can adjust the existing agent creation code slightly to first import the Anam plugin and then add in the AnamAvatarPublisher as a video processor (find more information about processors here):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from vision_agents.plugins.anam import AnamAvatarPublisher

# Rest of the code

Agent(
  edge=getstream.Edge(),
  agent_user=User(name=scenario.persona_name, id="sales-coach-avatar"),
  instructions=build_agent_instructions(scenario, session_context),
  stt=assemblyai.STT(),
  llm=gemini.LLM("gemini-3-flash-preview"),
  tts=inworld.TTS(voice_id=os.getenv("INWORLD_VOICE_ID", "Bianca")),
  processors=[
    AnamAvatarPublisher()
  ],
)

Once a call wraps up, the transcript gets piped through an LLM that evaluates the conversation. It scores the user across several aspects, e.g., how they handled objections, whether they stayed on message, and how they navigated tension. Only then it produces a final rating. It's not a vague thumbs-up; it's specific enough to act on.

For this we use the Gemini genai package, that we install with this command:

bash

1
uv pip install google-genai

We can then call the generate_content function and hand in the transcript of the call that we just observed:

1
2
3
4
5
6
7
8
9
10
response = genai.Client().models.generate_content(
  model="gemini-2.5-flash",
  contents=contents,
  config=types.GenerateContentConfig(
    temperature=0.2,
    response_mime_type="application/json",
    response_schema=GeminiEvaluationPayload,
    system_instruction=build_system_instruction(),
  ),
)

With GeminiEvaluationPayload as the pre-defined answer structure we’re giving into the model with the response_schema, we’re ensuring the format fits our frontend and upon returning it the React application is able to show the results. This gives us a clean implementation and useful feedback for the user.

Wrapping Up

This was surprisingly quick to put together. The Stream Video SDK abstracts away the complexity of real-time video, and Anam's avatars slot in cleanly as remote participants. The agentic backend, powered by Gemini, handles the conversational intelligence and scoring.

More importantly, it's genuinely useful. Practicing objection handling without consequences builds the kind of muscle memory that shows up when it counts. The more scenarios you run through, the less any single prospect interaction feels unfamiliar.

If you want to try building something similar, check out Stream’s Video SDK docs, Vision Agents GitHub, and Anam's developer resources to get started.

Build a Real-Time AI Sales Coach with Anam’s Digital Avatars, Stream Video, and Gemini

What Makes This Useful

The Stack

How the Backend Handles Scenarios

Wrapping Up