TLDR: Agents these days are blind and not very engaging, so we decided to team up with Anam and Inworld to build an agent using Vision Agents that feels personal and aware of the world around you.
Most voice agents today are blind. They hear words, convert them to text, run that text through an LLM, and read the response back in a flat, even tone. It doesn't matter if you're laughing, frustrated, or about to cry. The voice stays the same.
And visually, most of them still feel like software. You talk to a glowing orb, a waveform, or a pulsing circle sitting in the middle of the screen. There's no eye contact, no facial expression, no sense that something is actually present with you in the conversation.
That gap between "can talk" and "can actually connect" is the next thing worth closing. This kind of emotionally-aware agent has implications far beyond a demo conversation.
The same architecture can power interview coaches that adapt to nervous candidates in real time, tutors that notice confusion before a student asks for help, support agents that respond differently when frustration builds, and companion experiences that feel genuinely present instead of transactional.
We built a demo that closes it.
It's an open-source conversational agent that watches your face and hears what you’re saying in real time, classifies your emotion, gaze, and engagement from the video feed, and uses all of that to shape not just what it says but how it says it.
The voice softens when you look sad.
The pacing picks up when you're excited.
If you drift off-camera for a while, it gently re-engages instead of sitting there in silence.
The stack: Vision Agents for orchestration, Inworld TTS-2 for expressive voice, Anam for a lip-synced avatar, MediaPipe for face tracking, Gemini for the LLM, and Deepgram for speech-to-text. All running in real time over Stream's edge network. Here's how it works.
The Stack
Before we go deep on any single piece, here's the full agent setup. Everything wires together in create_agent:
12345678910111213141516171819202122from vision_agents.core import Agent, User from vision_agents.plugins import deepgram, gemini, getstream, inworld from vision_agents.plugins.anam import AnamAvatarPublisher async def create_agent(**kwargs) -> Agent: face_processor = MediaPipeFaceProcessor( model_path="models/face_landmarker.task", fps=8.0, ) avatar = AnamAvatarPublisher() agent = FacialAwareAgent( edge=getstream.Edge(), agent_user=User(name="Sarah", id="agent"), instructions=load_instructions(), tts=inworld.TTS(model_id="inworld-tts-2", voice_id="Sarah"), stt=deepgram.STT(), llm=gemini.LLM(model="gemini-3.1-flash-lite-preview"), processors=[face_processor, avatar], face_processor=face_processor, ) return agent
That's one agent, seven capabilities. STT transcribes the user. The LLM reasons over the transcript plus the facial state context. Inworld Realtime TTS-2 renders the response with expressive steering, like you’d guide an actor [act like you just got home from a long day, tired but warm]. MediaPipe watches the user's face at 8 fps. Anam renders a lip-synced avatar. Stream handles the real-time video and audio transport.
Each piece is a plugin. Swap Gemini for OpenAI, swap Deepgram for Sarvam, and change the avatar. Vision Agents doesn't care. It's designed to be composable.
Making the Voice Feel Alive: Inworld Realtime TTS-2
The biggest difference between this agent and a standard voice bot is what happens after the LLM produces text. Instead of piping that text through a flat TTS model, it goes through Inworld's Realtime TTS-2, which supports something called natural-language steering in 100+ languages.
Steering works like a director's note. You write bracketed instructions in natural language, and the model adjusts its delivery accordingly. Not from a dropdown of five emotions. In natural language, as descriptive as you want:
[say sadly with deliberate pauses in a low voice and hushed style]
yeah. yeah, okay — that's a lot.
[say warmly with light, easy energy]
Hey — your friendly neighborhood crashout buddy, reporting in. How's the day been?
[say with quiet curiosity, gently rising at the end]
What's coming up for you?
The richer the direction, the better the output. A bare [sad] gives the model one dimension. A fuller note that layers mood, rhythm, pitch, and vocal style produces something you'd believe came from a person on a video call.
TTS-2 also supports non-verbal sounds inline. [laugh], [sigh], [breathe], [cough] render as actual audio, not spoken words. You can place them anywhere in the text:
wait — you actually [laugh] said that to her?
And emphasis through capitalization. AbsoLUTEly. I told you NOT to do that. The engine handles stress placement without you needing phoneme-level control.
What makes this relevant to this project specifically: the LLM doesn't pick these steering tags randomly. It picks them based on what it knows about the user's current emotional state, which comes from the face tracker. That's the loop.
Giving the Agent Eyes: MediaPipe as a Vision Agents Processor
Here's where Vision Agents earns its name.
Processors are a core concept in the framework. They're lightweight components that run alongside the LLM, each processing video or audio frames at their own rate, independently. You can run YOLO at 20 fps for object detection, a depth model at 15 fps, and MediaPipe at 8 fps in the same agent. They don't block each other. You don't write threading code. The framework handles frame distribution.
In this project, the face tracker is a processor. It extends VideoProcessor from Vision Agents, hooks into the shared video forwarder, and runs MediaPipe's FaceLandmarker on every frame it receives:
12345678910111213141516171819202122from vision_agents.core.processors import VideoProcessor from vision_agents.core.utils.video_forwarder import VideoForwarder class MediaPipeFaceProcessor(VideoProcessor): name = "mediapipe_face" def __init__(self, model_path: str, fps: float = 8.0): super().__init__() self._model_path = model_path self._fps = fps self._smoothed = SmoothedBlendshapes(alpha=0.3) self._emotion_classifier = EmotionClassifier() self._gaze_classifier = GazeClassifier() self.current_state = FacialState() async def process_video(self, track, participant_id, shared_forwarder=None): if self._landmarker is None: self._landmarker = await asyncio.to_thread(self._build_landmarker) self._forwarder = shared_forwarder self._forwarder.add_frame_handler( self._on_frame, fps=self._fps, name="mediapipe_face" )
Each frame goes through a pipeline: convert to RGB, run FaceLandmarker, extract 52 blendshape coefficients, then classify into coarse labels.
The classification is simple on purpose. We don't need fine-grained emotion recognition. We need labels the LLM can act on. The output is a FacialState with four fields:
123456@dataclass class FacialState: emotion: Literal["neutral", "happy", "sad", "surprised", "thoughtful"] gaze: Literal["at_camera", "off_left", "off_right", "up", "down", "absent"] engagement: Literal["engaged", "distracted", "absent"] face_present: bool = False
Emotion classification uses blendshape thresholds. A smile is (mouthSmileLeft + mouthSmileRight) / 2 above 0.45. Surprise fires when browInnerUp or jawOpen crosses 0.55. Gaze combines head-pose yaw/pitch from the facial transformation matrix with eye-look blendshapes. Engagement is derived from gaze: if the user has been looking at the camera continuously for 3+ seconds, they're "engaged." Otherwise, "distracted."
The system uses two-tier thresholds with hysteresis to prevent flicker. A smile needs to cross 0.45 to enter "happy" but only needs to drop below 0.30 to leave it. A new emotion also has to persist for 4 consecutive frames (\~0.5 seconds at 8 fps) before it commits. Without this, the label would toggle every few frames as blendshapes hovered near boundaries, and the LLM would swing its TTS steering tag from turn to turn.
When the state changes, the processor emits a FacialStateChangedEvent into Vision Agents' event system. The agent subscribes to it and uses it to drive proactive behavior.
Closing the Loop: Face → LLM → Voice
This is where everything connects. The FacialAwareAgent is a subclass of Vision Agents' Agent that overrides simple_response to inject facial context into every LLM turn:
12345678class FacialAwareAgent(Agent): async def simple_response(self, text, participant=None): is_system_turn = text.startswith("[opening") or text.startswith("[proactive") if not is_system_turn: state = self._face_processor.current_state.to_prompt_string() if state: text = f"[user state: {state}] {text}" await super().simple_response(text, participant)
When the user says "my day was rough," the LLM doesn't just see that transcript. It sees:
[user state: sad, looking down] my day was rough
The system prompt teaches the LLM how to respond to each state. If the user looks sad and is looking down, the model picks a delivery tag like [say sadly with deliberate pauses in a low tone] and responds with something short and warm. If they're happy and looking at the camera, it matches their energy with [say playfully with a bright, warm tone].
The FacialState converts itself to a natural-language prompt fragment:
123456789101112131415def to_prompt_string(self) -> Optional[str]: if not self.face_present: return "user is not visible on camera" bits = [] if self.emotion != "neutral": bits.append(self.emotion) if self.gaze == "at_camera": bits.append("looking at camera") elif self.gaze in ("off_left", "off_right"): bits.append("looking off to the side") if self.engagement == "distracted": bits.append("seems distracted") elif self.engagement == "engaged" and self.emotion == "neutral": bits.append("attentive") return ", ".join(bits) if bits else None
Proactive re-engagement
The agent doesn't just react to what you say. It reacts to what you don't say.
If you've been distracted or off-camera for 5 seconds and the agent isn't currently speaking, it fires a proactive nudge. The nudge includes a live snapshot of the facial state so the LLM can craft something contextual instead of a generic "you still there?":
12345678910111213def _build_proactive_cue(state: FacialState) -> str: if not state.face_present: return ( "[proactive cue: the user has been off-camera for a while. " "Gently open the door for them to come back in one short sentence.]" ) bits = [] if state.emotion != "neutral": bits.append(state.emotion) if state.gaze in ("off_left", "off_right"): bits.append("looking off to the side") snapshot = ", ".join(bits) if bits else "drifting from the camera" return f"[proactive cue: the user is currently {snapshot} and hasn't spoken for a while. Gently re-engage in one short sentence.]"
If you're looking down and seem sad, the agent might say [very quiet, gentle pace] Take your time. No rush. If you're looking off to the side, it might try [say with quiet curiosity] Something catching your eye? The key constraint: the agent never says "I notice you looking away." That feels like surveillance. The camera signal is guidance for the model, not narration it repeats.
Giving the Agent a Face: Anam
A voice agent that can see you and adapt is already a step beyond what most people have experienced. But there's still a disconnect when you're talking to a disembodied voice.
Anam closes that gap. It provides photorealistic, lip-synced avatars that render in real time, powered by their CARA model. The avatar doesn't loop a pre-recorded video or do basic mouth dubbing. It generates natural motion, micro-expressions, and synchronized lip movement from the TTS audio output.
In Vision Agents, adding an avatar is adding a processor:
12345678from vision_agents.plugins.anam import AnamAvatarPublisher avatar = AnamAvatarPublisher() agent = FacialAwareAgent( # ... other config processors=[face_processor, avatar], )
That's it. The avatar receives the TTS-2 audio stream and renders a synchronized video track that gets published back into the call. The user sees a face that moves with the words, reacts with appropriate expressions, and makes the interaction feel like a video call with a person, not a chat window with a speaker attached.
This matters more than it might seem at first. Research consistently shows that people engage more with faces. Anam cites a 70% user preference for Anam over regular voice agents, and a 44% increase in engagement when agents have a visual presence. For use cases like coaching, education, or companionship, that visual layer is the difference between something people try once and something they come back to.
Where This Goes
This demo is a specific implementation: a "crashout buddy" character that lets you vent about your day. But the underlying pattern is general. An agent that sees, listens, understands emotional context, speaks expressively, and has a face can be adapted for a lot of things:
Sales coaching. An avatar that watches your pitch delivery, reads your facial expressions, and gives feedback that matches your emotional state. If you're nervous, it doesn't hammer you with corrections. It softens the delivery and focuses on what you did well first.
Recruitment and mock interviews. Emotionally-aware interview practice where the agent adapts its tone when candidates are nervous, asks follow-ups that feel natural, and provides feedback grounded in what it actually observed during the session.
Companion apps. Agents that feel present because they respond to what they see, not just what they hear. The difference between a chatbot that says "how are you?" and one that notices you look tired before you say anything.
Education and tutoring. A tutor that catches the furrowed brow and the gaze drift before the student says "I'm confused." It adjusts the explanation, slows down, tries a different angle, all without the student needing to ask.
Customer support. An agent that hears a frustrated caller and actually responds differently. Not because someone scripted a "frustrated customer" pathway, but because the model registered the shift in real time.
The pattern is the same in every case: take the user's video feed, run a lightweight perception model on it, inject the results as context for the LLM, and let an expressive voice model render the response with appropriate delivery. Vision Agents makes the orchestration simple. Inworld Realtime TTS-2 makes the voice believable. Anam makes the agent visible.
The Bigger Picture
The next phase of AI isn't just models that handle text, or even voice. It's models that see and interact with the physical world through video. Most frameworks started with voice and bolted video on later. Vision Agents started with video as a first-class primitive.
This demo is one example of what that enables. Face tracking is one processor. You could run pose detection, object recognition, scene understanding, or any other perception model in the same agent, at the same time, each at their own frame rate.
The whole project is open-source. Clone the repo, plug in your API keys, and run it.
Links:
- Vision Agents — the framework
- Inworld TTS-2 — expressive voice
- Anam — real-time avatars
- GitHub repo — the code
