Voice Assistant: A Smart Voice Interface with Local Command Execution
Voice Assistant is a lightweight middleware layer that connects modern AI voice models to a dynamic function-calling engine — letting you run those calls against your own local subsystems.
In practice, this means you can spin up a fully functional AI voice chat, wire it to hardware clients (ESP32 with a mic and speaker, for example) or a desktop app, and define exactly which functions the model is allowed to invoke.
A home automation setup is the obvious use case. Flash your ESP32, point it at Voice Assistant, and you've got your own smart speaker — ask it to toggle a light or adjust the thermostat. What makes this different from off-the-shelf solutions is that you own the function registry. The assistant can also push audio back to your devices: play WAV files or synthesize text-to-speech notifications in the model's voice.
Security & Privacy
The core design principle: the AI model is the brain — the hands are yours, running locally.
- No hidden telemetry. Voice Assistant does not log your conversations in the background.
- Network isolation. The model never touches your LAN, MQTT broker, or GPIO directly. It receives function descriptors, decides which to call, and waits for your local server to report results — nothing more.
- Auth & encryption. Every client device authenticates with a unique
API_KEY, which is bound to a specific, scoped set of functions. Compromised device? Revoke one key, done. - Zero NAT headaches. No port forwarding, no router config required.
Cross-Platform, Single Binary
Voice Assistant ships as a self-contained binary with no external runtime dependencies. Supported targets:
- Windows (x64)
- Linux (amd64 / x86_64)
- Raspberry Pi (aarch64)
- macOS (aarch64 / Apple Silicon)
Drop the binary, run it — it daemonizes itself and starts listening for WebSocket connections from your ESP32s or other clients.
How Function Calling Works
Voice Assistant acts as a local execution layer. The model decides what to call; Voice Assistant decides how to run it.
Responsibility split:
Voice Assistant (local):
- WebSocket session management (
/ws/voice) - Audio pipeline: buffering, VAD, echo cancellation, resampling
- Parsing
functions.jsonconfiguration - Executing actions: HTTP requests, MQTT publish, GPIO commands, OS shell calls
- Connection orchestration, keepalives, heartbeat, and audio playback API
AI model (remote):
- Speech recognition and natural language understanding (STT / NLP)
- Voice response synthesis (TTS / audio generation)
- Intent routing — deciding whether to invoke a function and which one
Request lifecycle:
- User speaks; audio stream arrives at Voice Assistant.
- Voice Assistant forwards the stream to the AI model (e.g. Gemini Live).
- Model parses intent and emits a
toolCallevent (function name, arguments, ID). - Voice Assistant looks up the function in the local
functions.json. - The bound handler executes the action.
- Result is sent back to the model as a
toolResponse. - Model evaluates the outcome and speaks the result to the user.
Built-in Action Handlers
Out of the box, Voice Assistant supports the following integration types:
WEBHOOK— Fire HTTP requests to local services. Pairs well with n8n, Node-RED, or Home Assistant.MQTT— Publish messages to a local MQTT broker.GPIO— Send pin control commands directly over the active client connection.EXEC— Run shell commands on the host machine where Voice Assistant is running.SYSTEM— Internal actions such asstart_sessionorclose_session.MCP— MCP client - integration with external mcp servers. MCP server - provide tts to external mcp.
TTS and Audio Output
Voice Assistant supports two feedback channels for pushing audio back to your devices:
- WAV playback — stream a pre-recorded audio file to a specific speaker client.
- Text-to-speech — synthesize arbitrary text using the model's voice, with optional emotional inflection based on message context.
Bottom Line
If you want a reliable "say it → it happens locally" pipeline, Voice Assistant's architecture is built exactly for that. You get the full capability of a state-of-the-art AI as your voice interface, while keeping execution, security, and network access entirely within your own infrastructure.