Building a voice agent that feels responsive is hard. Users expect conversational AI to respond instantly, but the realities of LLM processing, tool execution, and text-to-speech synthesis introduce unavoidable latency.
The result? An awkward 3-second silence that makes your voice agent feel broken.
Speculative tool calling is the architectural pattern that solves this problem.
Why Does My Voice Agent Have Awkward Silences?
In a standard voice loop, latency stacks serially:
| Stage | What Happens | Typical Latency |
|---|---|---|
| ASR | Transcribe user speech | 300ms |
| LLM (decision) | Model decides to call a tool | 200-1000ms |
| Tool Execution | API call runs | 100-2000ms |
| LLM (response) | Model processes result | 300ms |
| TTS | Convert text to speech | 250-300ms |
These steps happen one after another. When a user asks "What's the weather in Boulder, Colorado?", they wait in silence while your system transcribes, thinks, calls an API, thinks again, and finally speaks. That's seconds of dead air.
The core insight: users perceive latency only when silence occurs. If you fill the processing gap with speech, users don't notice the wait.
What is Speculative Tool Calling?
Speculative tool calling breaks the serial chain by running processes in parallel and "optimistically" executing tools before you're certain they're needed.
Instead of a single pipeline, you split your voice loop into two parallel tracks the moment the user finishes speaking:
-
Track A (The Filler): Immediate conversational acknowledgement sent to TTS.
-
Track B (The Speculation): Silent tool prediction and execution happening in the background.
Here's how it plays out:
-
User says: "What's the weather like in Boulder, Colorado?"
-
Track A fires immediately: The LLM generates conversational filler, such as "Checking the forecast for Boulder, Colorado...", and streams it to TTS.
-
Track B fires simultaneously: A parallel process analyzes intent and calls get_weather(city="Boulder, CO").
-
Synchronization: While TTS reads the filler (buying 1.5-2 seconds), the tool executes. By the time "...Colorado" finishes playing, the result is ready.
-
Seamless continuation: The LLM appends "It looks like it's cloudy and 62 degrees."
The user hears continuous speech. The 3-second tool latency is hidden behind 3 seconds of filler.
How Do I Implement This With a Single LLM?
If you're using one model for everything, the key is prompt engineering. The problem lies in how standard tool-calling flows work.
When you make an API call to an LLM with tools enabled, you send your messages and a list of available tools. The model then responds with either text content or a tool call request. In most APIs (OpenAI, Anthropic, etc.), the model's response contains a tool_calls array when it decides to use a tool. Your application must then execute that tool, send the result back, and wait for another response.
The problem: The model outputs the tool call first, before any speech. Your application receives something like this:
123{ "tool": "check_balance" } // ... silence while tool executes ... "Your balance is $50."
At this point, there's nothing to send to TTS. Your user waits in silence while you execute the tool, send the result back to the model, and wait for the final text response: "Your balance is $50."
The fix: Instruct the LLM to output speech before requesting the tool. Instead of using the native tool-calling format, have the model output structured text that your application parses.
XML-style tags work better than JSON here because they're easier to parse while streaming. You can start piping text to TTS the moment you see
12<speech>Let me look up your current balance.</speech> <tool>check_balance()</tool>
Your voice engine streams the text within <speech> to the TTS engine immediately. Your execution engine parses <tool> and runs it in the background while the voice plays. Structure your system prompt to enforce this ordering:
If you need to call a tool, ALWAYS output a brief spoken acknowledgment first.
Format: <speech>[filler]</speech><tool>[tool_call]</tool>
Do not output the tool call without preceding speech.
What If I Want Even Lower Latency?
For sub-second responsiveness, use a two-stage architecture with a fast router model.
| Stage | Model | Latency | Purpose |
|---|---|---|---|
| Router | Small classifier or 8B model | ~50-100ms | Detect intent, predict tool |
| Main LLM | Your primary model | ~500ms+ | Generate natural response |
When the user audio ends, both fire simultaneously:
- The main LLM begins generating the conversational response.
- The Router Model predicts whether a tool is needed and, if so, which one.
If the router predicts a tool, it executes immediately. The result gets injected into the main LLM's context mid-generation or queues for the next sentence. The router can use regex-based pattern matching for common phrases ("what's the weather", "set a timer", "check my balance"), a fine-tuned small classifier, or a lightweight LLM like Llama-3-8B.
Can I Start Tool Calls Before the User Finishes Speaking?
Yes. This is called eager execution or pre-computation.
If the user's intent is highly probable (they're in a banking flow and likely asking about their balance), you can start the tool call during the voice activity detection (VAD) phase or transcription phase, before the sentence is fully transcribed.
This approach carries risk: you might call the wrong tool. Mitigate this by only eagerly executing "safe" read-only tools. Good candidates for eager execution include weather lookups, stock prices, account balances, and calendar queries. Avoid eager execution for anything that changes state, like transfers, purchases, message sending, or deletions.
If you speculate wrong on a read-only call, the only cost is wasted compute. If you speculate wrong on a write operation, you've potentially taken an irreversible action.
How Do I Handle Streaming Token Detection?
You can shave additional latency by parsing the LLM's output stream for tool call tokens and firing early:
12345678async for chunk in llm.stream(prompt): accumulated_text += chunk if "<tool>" in accumulated_text: # Don't wait for full generation tool_name = extract_partial_tool_name(accumulated_text) if confidence(tool_name) > threshold: fire_tool_early(tool_name)
This eliminates the gap between "model decided to call a tool" and "model finished generating the full call." Instead of waiting for the complete <tool>get_weather(city="Boulder, CO")</tool> to appear, you fire as soon as you see <tool>get_weather... with high confidence.
What Happens When Speculation Fails?
Sometimes you'll speculatively call the wrong tool or call a tool when none was needed. The good news: if your speculation happens on Track B (silent background execution), the user never knows.
Example scenario:
-
User says: "I don't want to check the weather."
-
System hears "check the weather" early and fires the API call.
-
Once the full transcription arrives, the LLM realizes the context was negative.
-
LLM simply ignores the tool output in its response.
The user hears the correct response. They never knew you "mistakenly" checked the weather because that happened silently. The only cost was a wasted API call.
For this to work cleanly, your architecture must keep speculative tool results separate from the response generation until the LLM explicitly decides to use them.
What Does the Code Architecture Look Like for Speculative Tool Calling?
You cannot use a simple await llm_response() pattern. You need an event-driven, streaming architecture with parallel task management.
A simple Python implementation might look like:
123456789101112131415161718async def handle_voice_input(user_audio): # 1. Transcribe text = await transcribe(user_audio) # 2. Fire parallel tasks filler_task = asyncio.create_task(generate_filler(text)) tool_task = asyncio.create_task(predict_and_execute_tool(text)) # 3. Stream filler to audio immediately async for chunk in stream_from_task(filler_task): yield to_audio(chunk) # 4. Await tool result (likely already done) tool_result = await tool_task # 5. Generate final answer with tool context final_answer = await generate_answer(text, tool_result) yield to_audio(final_answer)
A more aggressive speculative execution:
123456789101112131415161718192021async def handle_utterance(text): # Start likely tools speculatively based on context speculative_tasks = { "weather": asyncio.create_task(get_weather(user_location)), "calendar": asyncio.create_task(get_calendar(user_id)), "balance": asyncio.create_task(get_balance(user_id)), } # Simultaneously get LLM decision llm_response = await get_llm_response(text) if llm_response.tool_call in speculative_tasks: # Tool already executed, result ready result = await speculative_tasks[llm_response.tool_call] else: # Cancel unused speculative calls for task in speculative_tasks.values(): task.cancel() result = None return generate_final_response(llm_response, result)
The tradeoff with aggressive speculation is that you're burning API calls and compute on tools you may not need.
Which Approach Should I Use?
It depends on your constraints:
| Approach | Best For | Tradeoffs |
|---|---|---|
| Speech-first prompting | Single-LLM setups; simplest to implement | Adds ~20-50 tokens of latency for filler generation |
| Fast router model | High-volume production systems | Requires maintaining two models |
| Streaming token detection | When you control the LLM output format | Parsing complexity, potential for false positives |
| Eager/speculative execution | Predictable, constrained domains | Wasted compute, must limit to read-only operations |
For most applications, combine speech-first prompting with streaming token detection. The filler buys you 1.5-2 seconds of perceived responsiveness, and early tool firing cuts the remaining latency. Add speculative execution if you have a small, predictable set of tools and can tolerate the extra cost.
The main principle is to hide latency behind speech. Users don't mind waiting if they hear continuous, relevant audio. They only notice delays when silence occurs.
To implement speculative tool calling when building a voice agent:
-
Decouple speech from action by running filler generation and tool execution in parallel.
-
Prompt your LLM to output conversational acknowledgments before tool calls.
-
Parse streams aggressively to fire tools before the full call is generated.
-
Speculate safely by pre-executing read-only tools when intent is predictable.
-
Handle misses gracefully by keeping speculative results silent until explicitly used.
The 3-second silence disappears when you stop treating your voice loop as a serial pipeline.