Build multi-modal AI applications using our new open-source Vision AI SDK .

What Are the Best Platforms To Develop an AI Voice Chatbot?

Raymond F
Raymond F
Published April 22, 2026

"Hello, and thank you for calling Finsbury Bank. For English, press 1. Para español, oprima el dos." *beep* "Please say or enter your 16-digit account number, followed by the pound key." "Four, seven, two, …" "I'm sorry, I didn't catch that."

For about 20 years, that was voice AI. A decision tree, a radio voice, and a script you'd better get right.

In 2026, you can build an agent that understands what someone said, asks clarifying questions, and calls your APIs mid-conversation. The tooling is genuinely good now, and the "roll your own WebRTC pipeline" era is over.

What are the Main Types of Voice Chatbot Platforms in 2026?

The landscape sorts into four tiers, from most-managed to most-raw.

Managed platforms like Vapi, Retell, Bland, Synthflow, and Voiceflow give you a dashboard, bundled telephony, and a working agent in a few hours. You configure the assistant in a web UI and trigger calls with a thin API:

py
1
2
3
4
5
# Vapi: the pipeline is configured in the dashboard vapi.calls.create( assistant_id="asst_xyz", customer={"number": "+15551234567"}, )

Most support bring-your-own-LLM at this point, though the level of control varies. The tradeoff is a black-box debugging story and configuration ceilings you'll hit at scale.

Unified voice agent services like ElevenAgents and Deepgram's Voice Agent API sit one layer deeper. Still a single API, but each stage of the pipeline is explicitly configurable:

py
1
2
3
4
5
6
# Deepgram Voice Agent: one connection, configured providers agent.configure({ "listen": {"provider": "deepgram", "model": "nova-3"}, "think": {"provider": "openai", "model": "gpt-4o"}, "speak": {"provider": "deepgram", "model": "aura-2"}, })

You get real control over STT, LLM, and TTS without having to build the orchestration yourself. Deepgram's Voice Agent hit GA in November 2025.

SDK frameworks - LiveKit Agents (Apache-2.0), Pipecat (BSD-2, maintained by Daily), and Vision Agents (Apache-2.0, maintained by Stream) are open-source orchestration layers. You compose the pipeline in code, with whatever providers you want, running wherever you want:

py
1
2
3
4
5
6
7
8
# Pipecat: you build the pipeline yourself pipeline = Pipeline([ transport.input(), DeepgramSTTService(api_key=DG_KEY), OpenAILLMService(api_key=OAI_KEY, model="gpt-4o"), CartesiaTTSService(api_key=CART_KEY, voice_id=VOICE), transport.output(), ])

Vision Agents follows the same pattern but extends the pipeline to handle video frames natively - pairing STT/LLM/TTS with computer vision processors like YOLO and Roboflow, or passing video directly to a VLM:

py
1
2
3
4
5
6
# Vision Agents: voice + video in one pipeline agent = Agent( edge=getstream.Edge(), llm=gemini.Realtime(fps=3), processors=[ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt")] )

Any STT, any LLM, any TTS, any transport. All three frameworks matured enormously in 2025 and 2026, and they're what most teams move to once they outgrow a managed platform.

Speech-to-speech APIs are the newest category. OpenAI's gpt-realtime, Google's Gemini 2.5 and 3.1 Flash Live, and open models like Kyutai's Moshi or Sesame's CSM-1B. A single multimodal model ingests raw audio and emits raw audio, skipping STT and TTS entirely:

py
1
2
3
4
5
6
# OpenAI Realtime: audio in, audio out, one model async with openai.realtime.connect(model="gpt-realtime") as conn: await conn.send_audio(mic_chunk) async for event in conn: if event.type == "response.audio.delta": play(event.audio)

Fast and emotionally expressive, but you're locked to the provider's model and its handful of built-in voices.

Should I Use a Cascaded Pipeline or a Speech-To-Speech Model?

A cascaded pipeline chains voice activity detection, then speech-to-text, then your LLM, then text-to-speech. Speech-to-speech collapses all of that into one model that takes in audio and produces audio.

The consensus is that cascaded is still the production default. Most production voice agents today still run GPT-4o or Gemini 2.5 Flash, which are 18-month-old models, because the intelligence-to-latency sweet spot hasn't meaningfully shifted, and switching models in a voice pipeline is uniquely painful. That inertia applies even more to speech-to-speech, which locks you to a single vendor's entire stack at once.

The reasons cascaded still wins in production:

  • You get a text transcript at every stage, which compliance reviews require.
  • Function-calling reliability remains high during long sessions. Speech-to-speech degrades noticeably past about 20 turns.
  • Cost is several times lower. A cascaded pipeline runs roughly $0.05 to $0.15 per minute all-in, while OpenAI's gpt-realtime lands somewhere between $0.15 and $0.60 per minute, depending on how big your system prompt is. S2S APIs also re-bill the growing audio context each turn, though OpenAI's cached-input pricing has softened that somewhat.
  • You can pick from thousands of voices in ElevenLabs' library, rather than the 10 to 30 built-in voices a speech-to-speech model offers.

Speech-to-speech wins cleanly in narrower cases: companion apps, coaching, greeters, anywhere emotional prosody matters more than audit trails or cost. A common hybrid pattern is to use a speech-to-speech model for the opening acknowledgment, then hand off to a cascaded pipeline for the transactional turns.

Building your own app? Get early access to our Livestream or Video Calling API and launch in days!

Managed Platform or SDK Framework?

Speed or control is the tradeoff.

Managed platformSDK framework
Time to working agentHoursDays to weeks
Pipeline controlConfigure what the vendor exposesAny component, any provider
DebuggingBlack-box, support ticketsFull visibility, your code
CeilingYou'll hit one eventuallyYour infra is the ceiling
ExamplesVapi, Retell, Bland, SynthflowLiveKit Agents, Pipecat, Vision Agents

Managed platforms like Vapi and Retell are the right call when you need to be live this week. Configure an assistant, point it at your LLM, attach a phone number, and done. You're working inside someone else's abstraction though, so edge cases hit configuration ceilings and debugging means support tickets instead of your own code.

SDK frameworks are the right call when voice is core to your product.

  • LiveKit Agents bundles a WebRTC media server, a production SIP stack, and client SDKs for every major platform, including ESP32.
  • Pipecat, maintained by Daily, is more vendor-neutral with 100+ integrations, the best embedded story, and the widest LLM support, including local Ollama.
  • Vision Agents, maintained by Stream, is the pick when your agent needs to see as well as hear. It adds first-class video processing - VLMs, YOLO, Roboflow, pose detection to a standard voice pipeline, and runs on Stream's global WebRTC edge network with sub-500ms latency out of the box.

All are open source and free to run. Hamming's analysis of 4M+ production calls shows about half of managed-platform teams migrate to one within 12 months of crossing \~10K minutes/month.
One thing that gets missed is if your agent lives inside a web or mobile app rather than over PSTN, the real-time transport layer is its own decision, separate from the orchestrator. Audio has to reach your agent and back with sub-second reliability under packet loss and jitter.

LiveKit, Daily, Agora, Twilio, and Stream all offer WebRTC SDKs for this - Stream's edge network is also the transport layer Vision Agents runs on, so teams using that framework get it bundled.

Which STT and TTS Should I Pair With My Orchestrator?

Whichever scores best on your traffic. Vendor benchmarks are measured on clean, curated audio that sounds nothing like your users on a bad cell connection. That said, here's how to narrow the shortlist before you start benchmarking.

STT providers sort into three camps:

  • Voice-AI specialists (Deepgram, AssemblyAI). Purpose-built for real-time streaming with fast partial results, tuned on conversational audio, good at dealing with crosstalk and overlapping speech. The default starting point for most voice agents.
  • Generalist labs (OpenAI gpt-4o-transcribe, Google Chirp). Strong multilingual accuracy, but less optimized for the specific latency profile voice agents need. OpenAI's streaming STT is only available through the Realtime API.
  • Multilingual-first (Gladia, Speechmatics, Soniox). The pick if you're genuinely global (60-100+ languages) or need code-switching within a single utterance.

TTS providers sort similarly:

  • Voice-AI specialists (ElevenLabs Flash v2.5, Cartesia Sonic-3). The default picks. ElevenLabs has the biggest voice library and the most natural English; Cartesia is the latency and quality leader and ships emotion tags.
  • Unified-stack players (Deepgram Aura-2, OpenAI gpt-4o-mini-tts). Worth considering if you want a single vendor for STT, TTS, and LLM, and you don't need voice cloning.
  • Value challengers (Inworld, Rime, Hume). Significantly cheaper than ElevenLabs with competitive quality on specific use cases. Worth benchmarking on your exact voices.

The criteria that actually determine your pick:

  • Time-to-first-byte over total latency. A slow first byte kills perceived responsiveness, even if the full utterance is generated in under a second.
  • Pronunciation on your domain vocabulary. Brand names, SKU codes, medical or legal terms. This is where generic benchmarks mislead you.
  • Voice stability across sessions. Does the same voice sound the same on Monday and Friday?
  • Vendor survivability. The voice AI market is still churning. PlayHT was acquired by Meta and shut down in December 2025, and older comparison articles still recommend it. Pick vendors with runway.

Two gotchas worth knowing. Whisper, in any hosted form, including Groq's fast version, is not true streaming; it transcribes in chunks and is fine for post-call but bad for real-time. And ElevenLabs v3 is content-generation only, not real-time yet; for voice agents on ElevenLabs, use Flash v2.5.

What Makes a Voice Chatbot Feel Natural (Or Not)?

Voice AI's uncanny valley is about timing, not accuracy. The failures that matter in production are responses that land one beat off, interruptions at the wrong moment, or the missing "mm-hmm" a person would have made there. Benchmarks don't catch most of this.

The specific problems to watch for:

  • Turn detection. Silence-based endpointing creates perpetual lag and still cuts users off when they pause to think. Modern semantic turn-detection models are a night-and-day improvement from even models a year old.
  • Interruption handling. The agent has to stop the moment the user starts, without treating every cough or “yeah, okay” as an interruption. Adaptive models reject about half of the false positives a naive VAD catches.
  • Function-calling degradation. Even frontier models reliably call tools on turn 3 but fail on turn 20. Production voice calls routinely run 30+ turns.
  • Backchanneling. The “uh-huh”s humans produce without thinking. Attempts to add it as a system feature have failed catastrophically; most production agents skip it.

Voice is a brand choice in a way chat isn't. The specific voice, its speaking rate, and its breathing cadence all become part of your product identity within the first 10 seconds of a call. Budget time for picking and iterating on the voice, not just the TTS provider.

What Should I Actually Build a Voice Chatbot With?

The open-source orchestration layer has genuinely won. A technical team building a voice-only agent should default to LiveKit Agents or Pipecat, paired with Deepgram Nova-3 for speech-to-text and Cartesia Sonic-3 or ElevenLabs Flash v2.5 for text-to-speech, with whatever LLM is winning the week behind it.

If video is in scope, agents that see the user via camera, analyze live feeds, or run computer vision alongside voice Vision Agents is the framework built for that problem.

Managed platforms like Vapi and Retell are still the right call if you need something live this week or if voice is secondary to your core product. Speech-to-speech APIs are worth watching, but not yet the default. And the transport layer between your users and your agent, whether that's WebRTC for in-app or SIP for telephony, is worth treating as its own decision rather than a detail you let the orchestrator pick for you.

Integrating Video with your App?
We've built a Video and Audio solution just for you. Check out our APIs and SDKs.
Learn more