Voice Assistant: A Smart Voice Interface with Local Command Execution

Voice Assistant is a lightweight middleware layer that connects modern AI voice models to a dynamic function-calling engine — letting you run those calls against your own local subsystems.

In practice, this means you can spin up a fully functional AI voice chat, wire it to hardware clients (ESP32 with a mic and speaker, for example) or a desktop app, and define exactly which functions the model is allowed to invoke.

A home automation setup is the obvious use case. Flash your ESP32, point it at Voice Assistant, and you've got your own smart speaker — ask it to toggle a light or adjust the thermostat. What makes this different from off-the-shelf solutions is that you own the function registry. The assistant can also push audio back to your devices: play WAV files or synthesize text-to-speech notifications in the model's voice.

Security & Privacy

The core design principle: the AI model is the brain — the hands are yours, running locally.

No hidden telemetry. Voice Assistant does not log your conversations in the background.
Network isolation. The model never touches your LAN, MQTT broker, or GPIO directly. It receives function descriptors, decides which to call, and waits for your local server to report results — nothing more.
Auth & encryption. Every client device authenticates with a unique API_KEY, which is bound to a specific, scoped set of functions. Compromised device? Revoke one key, done.
Zero NAT headaches. No port forwarding, no router config required.

Cross-Platform, Single Binary

Voice Assistant ships as a self-contained binary with no external runtime dependencies. Supported targets:

Windows (x64)
Linux (amd64 / x86_64)
Raspberry Pi (aarch64)
macOS (aarch64 / Apple Silicon)

Drop the binary, run it — it daemonizes itself and starts listening for WebSocket connections from your ESP32s or other clients.

How Function Calling Works

Voice Assistant acts as a local execution layer. The model decides what to call; Voice Assistant decides how to run it.

Responsibility split:

Voice Assistant (local):

WebSocket session management (/ws/voice)
Audio pipeline: buffering, VAD, echo cancellation, resampling
Parsing functions.json configuration
Executing actions: HTTP requests, MQTT publish, GPIO commands, OS shell calls
Connection orchestration, keepalives, heartbeat, and audio playback API

AI model (remote):

Speech recognition and natural language understanding (STT / NLP)
Voice response synthesis (TTS / audio generation)
Intent routing — deciding whether to invoke a function and which one

Request lifecycle:

User speaks; audio stream arrives at Voice Assistant.
Voice Assistant forwards the stream to the AI model (e.g. Gemini Live).
Model parses intent and emits a toolCall event (function name, arguments, ID).
Voice Assistant looks up the function in the local functions.json.
The bound handler executes the action.
Result is sent back to the model as a toolResponse.
Model evaluates the outcome and speaks the result to the user.

Built-in Action Handlers

Out of the box, Voice Assistant supports the following integration types:

WEBHOOK — Fire HTTP requests to local services. Pairs well with n8n, Node-RED, or Home Assistant.
MQTT — Publish messages to a local MQTT broker.
GPIO — Send pin control commands directly over the active client connection.
EXEC — Run shell commands on the host machine where Voice Assistant is running.
SYSTEM — Internal actions such as start_session or close_session.
MCP — MCP client - integration with external mcp servers. MCP server - provide tts to external mcp.

TTS and Audio Output

Voice Assistant supports two feedback channels for pushing audio back to your devices:

WAV playback — stream a pre-recorded audio file to a specific speaker client.
Text-to-speech — synthesize arbitrary text using the model's voice, with optional emotional inflection based on message context.

Bottom Line

If you want a reliable "say it → it happens locally" pipeline, Voice Assistant's architecture is built exactly for that. You get the full capability of a state-of-the-art AI as your voice interface, while keeping execution, security, and network access entirely within your own infrastructure.