AI won’t stay online. It won’t stay on your laptop. It won’t stay centralized. It will move to every device and to the edge of every network, into your earbuds, your car, your factory floor, and your doorbell.
This opens up a remarkable number of use cases. A fitness coach who listens continuously, counts your reps, and shouts encouragement when you're flagging. An accessibility device that detects non-verbal cues and synthesizes speech for someone who can't. A wearable that translates speech in real-time, preserving the timing and emotion of the original speaker.
These applications share a common requirement: they need edge-optimized speech workflows with speech-to-text (SST) and text-to-speech (TTS) that's fast, accurate, and capable of running with minimal cloud dependency. An architecture that combines Deepgram Nova-3 for STT with Fish Speech V1.5 for TTS is ideal.
In this build, we want to show you not just how to wire up APIs, but give you a foundation you can adapt for coaching apps, accessibility tools, voice dubbing pipelines, or whatever edge-native speech workflow you're building.
An Edge-Optimized Speech Workflow from Microphone to Speaker
A speech workflow is the pipeline that converts spoken input into useful output. At minimum: microphone to speech-to-text to some processing logic to text-to-speech to speaker.
The classic architecture sends audio to cloud APIs at every step. This works fine when you have reliable connectivity and can tolerate round-trip latency.
Edge optimization means pushing as much of this pipeline as possible closer to the user. Sometimes that means running models locally on the device. Sometimes it means a hybrid approach where you use cloud APIs when available, but fall back to local models when offline. The goal is responsiveness and reliability, not ideological purity about where compute happens.
The architecture in this build uses a hybrid approach. Deepgram handles STT in the cloud (fast, accurate, streaming). Fish Speech handles TTS locally or via their API. Voice activity detection runs locally via Silero VAD (Voice activity detection). The result is a system that can respond in real-time when online and degrade gracefully when offline.
Deepgram Nova-3 STT
Deepgram Nova-3 is a speech-to-text model optimized for real-time streaming. The key features for edge workflows:
- Streaming transcription. Results arrive as the user speaks, not after they finish. The SDK maintains a WebSocket connection and emits both interim (partial) and final transcripts.
- Endpoint detection. Nova-3 detects when a user has finished speaking based on a configurable silence duration. No need to build your own voice activity detection for turn-taking.
- Smart formatting. Automatic punctuation, capitalization, and numeral formatting. "I need three hundred dollars" becomes "I need $300."
- Low word error rate. Deepgram claims Nova-3 has 54% lower WER than competitors. In practice, it handles accents, background noise, and casual speech well.
The tradeoff is cloud dependency. Deepgram requires an API key and an internet connection. For offline fallback, this build uses faster-whisper, a CTranslate2-optimized version of OpenAI's Whisper that runs locally.
Fish Speech V1.5 TTS
Fish Speech is an open-source text-to-speech model that can run locally or via Fish Audio's cloud API. The key features:
- Emotion tags. Prefix text with tags like [happy] or [serious] to control the emotional tone of the output. The model supports the following emotions: neutral, happy, sad, angry, fearful, surprised, disgusted, calm, serious, and excited.
- Voice cloning. Provide a reference audio sample, and Fish Speech will synthesize speech in that voice. Useful for consistent character voices or personalized assistants.
- Local inference. The model runs on consumer-grade GPUs, enabling fully offline TTS. For lower-powered devices, the Fish Audio API provides the same quality without local compute.
The combination of emotion control and voice cloning makes Fish Speech particularly useful for applications where the voice needs personality, not just intelligibility.
Building a Speech Pipeline
The full implementation is available in this repository, but let's walk through the key pieces so you understand how they fit together. We'll start at the system boundaries and work inward to the orchestration layer.
The Data Types That Flow Between Components
Before diving into components, it helps to see what flows between them. The pipeline passes around three simple structures.
An AudioChunk carries raw audio samples. The data is normalized to the float32 range -1.0 to 1.0, which is the standard format for audio processing libraries. The sample rate travels with the data because different parts of the pipeline expect different rates: STT typically wants 16kHz, while TTS outputs at 44.1kHz.
123456789101112131415161718@dataclass class AudioChunk: data: NDArray[np.float32] # Audio samples, mono, -1.0 to 1.0 sample_rate: int timestamp: datetime = field(default_factory=datetime.now) @property def duration_seconds(self) -> float: return len(self.data) / self.sample_rate def resample(self, target_rate: int) -> "AudioChunk": if self.sample_rate == target_rate: return self ratio = target_rate / self.sample_rate new_length = int(len(self.data) * ratio) indices = np.linspace(0, len(self.data) - 1, new_length) resampled = np.interp(indices, np.arange(len(self.data)), self.data) return AudioChunk(data=resampled.astype(np.float32), sample_rate=target_rate)
STT produces TranscriptResult objects. The is_final flag is crucial for understanding the streaming model: as you speak, the STT engine emits interim results that may change as it hears more context. "I want to" might become "I want to go" and then "I want to go to the store." Only when the engine detects you've finished speaking does it emit a final result.
123456@dataclass class TranscriptResult: text: str is_final: bool confidence: float = 1.0 words: list[WordInfo] = field(default_factory=list)
TTS takes a TTSRequest and returns audio. The emotion field is Fish Speech-specific, but structuring it this way keeps the interface generic enough that you could swap in a different TTS engine without changing calling code.
123456@dataclass class TTSRequest: text: str voice_id: str = "default" emotion: Optional[str] = None # "happy", "sad", "serious", etc. speed: float = 1.0
Capturing Audio Across Thread Boundaries
Audio capture seems simple until you realize it involves two different concurrency models colliding. The sounddevice library uses a callback architecture: it runs a background thread that invokes your callback whenever a new audio buffer is ready. But the rest of our pipeline is async Python, which runs on a single thread with cooperative multitasking.
The bridge between these worlds is call_soon_threadsafe. When sounddevice's thread receives new audio, it can't just enqueue it directly into an asyncio queue (that would create a race condition). Instead, it schedules the put operation to run on the asyncio thread at the next opportunity.
1234567891011121314151617181920class MicrophoneInput: async def stream(self) -> AsyncIterator[AudioChunk]: self._queue = asyncio.Queue() self._running = True def callback(indata, frames, time_info, status): if self._running and not self._muted: data = indata.copy().flatten().astype(np.float32) self._loop.call_soon_threadsafe(self._queue.put_nowait, data) with sd.InputStream( samplerate=self.sample_rate, channels=1, dtype=np.float32, blocksize=self.chunk_samples, callback=callback, ): while self._running: data = await asyncio.wait_for(self._queue.get(), timeout=0.5) yield AudioChunk(data=data, sample_rate=self.sample_rate)
The mute() and unmute() methods exist for a reason that becomes clear when you run the full pipeline: when the assistant speaks through the speakers, the microphone picks up that audio. Without muting, the system would transcribe its own output and potentially respond to itself in an infinite loop.
Voice Activity Detection to Filter Silence Without Clipping Speech
Raw microphone input is a continuous stream of audio chunks regardless of whether anyone is speaking. Sending all of this to the STT engine wastes bandwidth and can confuse the transcription (long silences sometimes produce hallucinated text). Voice Activity Detection filters the stream, passing only chunks containing speech.
Silero VAD is a small neural network that takes an audio chunk and outputs a probability that the audio contains speech. But we can't just threshold each chunk independently. Consider what happens when someone pauses mid-sentence: a naive filter would cut off the audio during the pause, splitting one utterance into two fragments. And at the start of speech, the first few milliseconds might fall below the threshold even though they're the beginning of a word.
The solution is buffering with hysteresis. We accumulate chunks until we're confident speech has started (several consecutive chunks above threshold), then we yield all the buffered chunks plus everything that follows. We stop yielding only when we observe sustained silence (several consecutive chunks below the threshold). The padding ensures we don't clip the beginnings and endings of words.
1234567891011121314151617181920212223242526272829class SileroVAD: def __init__(self, config: VADConfig): self.config = config self._model, _ = torch.hub.load("snakers4/silero-vad", "silero_vad", onnx=True) def get_speech_probability(self, audio: np.ndarray, sample_rate: int) -> float: audio_tensor = torch.from_numpy(audio) return self._model(audio_tensor, sample_rate).item() async def filter_speech(self, audio_stream: AsyncIterator[AudioChunk]) -> AsyncIterator[AudioChunk]: speech_buffer = [] in_speech = False async for chunk in audio_stream: is_speech = self.get_speech_probability(chunk.data, chunk.sample_rate) > self.config.threshold if is_speech: speech_buffer.append(chunk) if not in_speech and len(speech_buffer) >= min_chunks: in_speech = True for buffered in speech_buffer: yield buffered speech_buffer = [] elif in_speech: yield chunk else: if in_speech and silence_long_enough: in_speech = False speech_buffer = []
Streaming SST to Deepgram While Receiving Transcripts Back
Deepgram's streaming API creates a persistent WebSocket connection. Unlike a REST API, where you send a complete audio file and receive a complete transcript, here two things happen simultaneously: you continuously send audio chunks, and Deepgram continuously returns transcript updates.
This bidirectional flow requires two concurrent tasks. One task reads from our audio stream and writes to the WebSocket. The other task reads transcript events from Deepgram and yields them to our caller. The asyncio.create_task call starts the sender running in the background while the main function body handles receiving.
1234567891011121314151617181920212223242526272829303132333435class DeepgramSTT: async def transcribe_stream(self, audio_stream: AsyncIterator[AudioChunk]) -> AsyncIterator[TranscriptResult]: result_queue = asyncio.Queue() async def on_message(result, **kwargs): transcript_text = result["channel"]["alternatives"][0]["transcript"] is_final = result.get("is_final", False) or result.get("speech_final", False) if transcript_text.strip(): await result_queue.put(TranscriptResult( text=transcript_text, is_final=is_final, )) socket = await self._client.transcription.live({ "model": "nova-3", "interim_results": True, "utterance_end_ms": 1000, "smart_format": True, }) socket.registerHandler(socket.event.TRANSCRIPT_RECEIVED, on_message) async def send_audio(): async for chunk in audio_stream: audio_bytes = (chunk.data * 32767).astype(np.int16).tobytes() socket.send(audio_bytes) await socket.finish() send_task = asyncio.create_task(send_audio()) while True: result = await result_queue.get() if result is None: break yield result
The utterance_end_ms parameter controls Deepgram's endpoint detection. When Deepgram sees 1000ms of silence, it finalizes the current transcript and emits a result with is_final=True. This is separate from VAD: VAD filters what we send to Deepgram, while utterance_end_ms controls when Deepgram decides an utterance is complete. For coaching applications where you want to let users pause to think, you might increase this to 1500ms. For rapid back-and-forth, 500ms feels more responsive.
Text-to-Speech with Fish Audio
Fish Speech can run locally or via Fish Audio's cloud API. The cloud API is more straightforward to set up: no model downloads, no GPU required. You POST text and get back audio.
The emotion tags are a simple but effective feature. Prepend [happy] or [serious] to your text, and Fish Speech adjusts the prosody accordingly. For a coaching application, this lets you match tone to content: encouragement might use [excited], while corrections use [calm].
123456789101112131415161718class FishAudioAPI: EMOTIONS = ["neutral", "happy", "sad", "angry", "fearful", "surprised", "calm", "serious", "excited"] async def synthesize(self, request: TTSRequest) -> TTSResult: text = request.text.strip() # Prepend emotion tag if specified if request.emotion and request.emotion in self.EMOTIONS: text = f"[{request.emotion}]{text}" response = await self._client.post("/v1/tts", json={ "text": text, "reference_id": self.config.reference_id, # For voice cloning "format": "mp3", }) audio = self._decode_mp3(response.content) return TTSResult(audio_data=audio, sample_rate=44100, request=request)
The reference_id parameter enables voice cloning. If you've uploaded a reference audio sample to Fish Audio, you can pass its ID here, and the synthesized speech will match that voice. This is useful for creating a consistent persona, or for accessibility applications where you want the device to speak in the user's own voice.
Switching to Local TTS Without Changing the Rest of the Code
The architecture makes it straightforward to swap Fish Audio's cloud API for local inference. Both implementations share the same TTSProvider interface, so the rest of the pipeline doesn't need to know which one is running.
12345678910111213class TTSProvider(ABC): @abstractmethod async def synthesize(self, request: TTSRequest) -> TTSResult: ... async def synthesize_stream(self, request: TTSRequest) -> AsyncIterator[AudioChunk]: result = await self.synthesize(request) yield result.to_chunk() @property @abstractmethod def sample_rate(self) -> int: ...
The local provider loads the model on first use and runs inference in a thread pool. Running neural network inference in a thread pool is essential: it is CPU-intensive, and running it directly in the async event loop would block all other tasks (including audio capture) until it completes.
123456789101112131415161718192021class FishSpeechTTS(TTSProvider): def _ensure_model_loaded(self): if self._model is not None: return device = "cuda" if torch.cuda.is_available() else "cpu" model_path = self.config.model_path or Path("models/fish-speech") self._model = load_fish_speech_model(model_path, device) async def synthesize(self, request: TTSRequest) -> TTSResult: self._ensure_model_loaded() text = request.text.strip() if request.emotion and request.emotion in self.EMOTIONS: text = f"[{request.emotion}]{text}" loop = asyncio.get_event_loop() audio = await loop.run_in_executor(None, self._synthesize_impl, text, request.speed) return TTSResult(audio_data=audio, sample_rate=44100, request=request)
A factory function selects the provider based on configuration:
1234567def create_tts_provider(config: VoiceLoopConfig) -> TTSProvider: if config.tts_provider == "fish_audio_api": return FishAudioAPI(config.fish_speech) elif config.tts_provider == "fish_speech": return FishSpeechTTS(config.fish_speech) elif config.tts_provider == "edge_tts": return EdgeTTS()
Switching from cloud to local is a one-line config change:
12345# Cloud TTS (requires API key, works on any hardware) tts_provider: fish_audio_api # Local TTS (requires model download, needs GPU for reasonable speed) tts_provider: fish_speech
This same pattern applies throughout the codebase. Deepgram can be replaced with local Whisper; OpenAI can be replaced with Anthropic or local Ollama. The voice loop doesn't know or care which providers are running.
Tying It All Together in the Voice Loop
The VoiceLoop class ties everything together. It creates the async generator chain (microphone → VAD → STT), processes transcripts as they arrive, generates responses, and speaks them.
The flow is worth tracing through. When the loop starts, mic.stream() begins yielding audio chunks. Those chunks flow through vad.filter_speech(), which only passes through chunks containing speech. Those filtered chunks flow into stt.transcribe_stream(), which sends them to Deepgram and yields transcript results as they arrive.
Most transcript results are partial and get ignored. When a final transcript arrives, we generate a response and speak it. While speaking, the microphone is muted to prevent feedback.
1234567891011121314151617181920212223242526272829class VoiceLoop: async def run(self): audio_stream = self.mic.stream() if self.vad: audio_stream = self.vad.filter_speech(audio_stream) async for transcript in self.stt.transcribe_stream(audio_stream): if not transcript.is_final: continue if self._speaking and self.config.interruption_enabled: self.speaker.stop() self._speaking = False response = await self._generate_response(transcript.text) await self._speak_response(response) async def _speak_response(self, text: str): self._speaking = True self.mic.mute() try: request = TTSRequest(text=text, emotion=None, speed=1.0) audio_stream = self.tts.synthesize_stream(request) await self.speaker.play_stream(audio_stream) finally: self._speaking = False self.mic.unmute()
The interruption handling addresses a common annoyance with voice assistants: you ask a question, realize you misspoke, and have to wait for the assistant to finish its response before you can correct yourself. With interruption enabled, speaking while the assistant is talking immediately stops playback. The assistant processes what you said instead of finishing its previous thought.
The sequence matters here. When we detect speech while _speaking is true, we call speaker.stop() to halt playback, set _speaking to false, then continue to process the transcript normally. The microphone was muted during playback, but the STT buffer may still contain audio from before playback started. That's the audio we're responding to.
Putting It Together With a Coaching Assistant
The coaching_assistant.py example shows these pieces working together. It configures the pipeline for continuous listening with an LLM that provides domain-specific coaching feedback.
1234567891011121314151617COACHING_PROMPTS = { "interview": """You are a professional interview coach. Ask common interview questions, evaluate answers for structure and confidence, and suggest stronger ways to phrase experiences. After each answer, give brief feedback then ask the next question.""", } class CoachingAssistant: def __init__(self, mode: str = "speaking", llm_provider: str = "openai"): self.config = VoiceLoopConfig( mode="hybrid", stt_provider="deepgram", tts_provider="fish_audio_api", ) self.config.deepgram.utterance_end_ms = 1500 # Longer pause tolerance self.config.fish_speech.reference_id = "b545c585f631496c914815291da4e893" self.config.llm.system_prompt = COACHING_PROMPTS[mode]
The interview coaching mode demonstrates the full loop in action. The assistant asks a question, waits for your answer (tolerating pauses as you think), provides feedback, and then asks the next question. In the terminal, it looks like this:
123456789101112131415161718192021222324252627282930313233343536# LLM client initializes first 2026-01-09 14:56:16 [info ] OpenAI client initialized model=None # Microphone mutes while the coach speaks its opening greeting 2026-01-09 14:56:17 [debug ] Microphone muted 2026-01-09 14:56:17 [info ] Fish Audio API client initialized 2026-01-09 14:56:20 [debug ] Fish Audio synthesis complete audio_duration=9.926530612244898 text_length=176 # Greeting finishes playing, microphone unmutes, system ready for user 2026-01-09 14:56:30 [debug ] Microphone unmuted Listening... (speak now) # Deepgram WebSocket connection opens, audio streaming begins 2026-01-09 14:56:30 [info ] Deepgram client initialized (SDK 2.x) 2026-01-09 14:56:30 [info ] Starting Deepgram streaming transcription diarize=False language=en-US model=nova-3 2026-01-09 14:56:31 [info ] Starting microphone input chunk_samples=512 device=None sample_rate=16000 # User speaks, Deepgram returns final transcript, LLM generates response You: A marketing manager at a SaaS company. Coach: Great, let's dive into some questions that you might encounter... # Microphone mutes during TTS playback to prevent feedback loop 2026-01-09 14:56:39 [debug ] Microphone muted 2026-01-09 14:56:41 [debug ] Fish Audio synthesis complete audio_duration=11.049795918367346 text_length=195 # Response finishes playing, microphone unmutes for next turn 2026-01-09 14:56:53 [debug ] Microphone unmuted # Second exchange: user gives brief answer, coach asks for more detail You: I was the leader of the campaign Coach: That's a good start. Could you elaborate more on the specifics?... # Same mute/synthesize/unmute cycle 2026-01-09 14:56:59 [debug ] Microphone muted 2026-01-09 14:57:04 [debug ] Fish Audio synthesis complete audio_duration=20.297142857142855 text_length=327 2026-01-09 14:57:25 [debug ] Microphone unmuted
The audio output from this conversation sounds like this
The latency between finishing your answer and receiving feedback is typically under a second. Fast enough that the interaction feels like a conversation rather than a call-and-response with a server.
Taking the Voice Loop From Demo to Product
The pipeline we've built handles the core speech loop: listen, understand, respond, speak. It runs in hybrid mode by default, using cloud APIs when available and falling back to local models when not. The abstraction boundaries make it straightforward to swap components as your requirements change.
This is the foundation on which we can build any speech loop. Consider the three applications from the introduction:
- The fitness coach needs to run locally on a wearable with intermittent connectivity. The offline configuration handles this, but you'd want to add a wake word so it isn't always listening and draining battery. The LLM would need state to track rep counts across utterances, and you'd use Fish Speech's emotion tags to match energy to the moment:
[excited]for encouragement,[calm]for form corrections. - The accessibility device has stricter latency requirements than conversational applications. When someone is waiting for their words to be spoken, even 500ms feels slow. You'd want to stream TTS playback while synthesis is still in progress, rather than waiting for the full audio to finish. The
synthesize_streaminterface is already in place for this. You'd also need robust error recovery: if Deepgram disconnects mid-utterance, the device can't just fail silently. - The real-time translator introduces a challenge that the current architecture doesn't address: preserving timing and prosody from the source speech. You'd extract word-level timestamps from Deepgram (the
wordsfield inTranscriptResult), translate in chunks rather than waiting for complete sentences, and use Fish Speech's speed parameter to match the original pacing. Interruption handling becomes essential here since the speaker won't pause neatly between sentences.
AI moving to the edge isn't a prediction. It's already happening in hearing aids, cars, and factory equipment. The constraint has always been the software: stitching together speech recognition, language understanding, and synthesis in a way that's fast enough to feel natural and robust enough to work without constant connectivity.
The tools have finally caught up. Deepgram streams transcripts in real-time. Fish Speech runs on consumer hardware. The pipeline in this repo is a few hundred lines of Python. The hard part now is deciding what to build with it.
