DIY Voice Input: Navigating Trade-Offs and Engineering Solutions
When building custom voice interfaces, audio input is undoubtedly the hardest part, full of technological trade-offs. This is where we come face-to-face with concepts like full-duplex, half-duplex, digital signal processing (DSP), and acoustic echo cancellation (AEC).
The Feedback Loop: Why Can't You Just Interrupt the AI?
In the early stages of the project, we quickly got the Wake Word up and running, establishing a fully functional full-duplex stream between the ESP32-S3 and our server. We even implemented interruption capabilities across the entire stack—from the microcontroller up to the AI model and back.
However, without robust echo cancellation, the system degrades instantly. When the speaker plays the AI's response, the microphone picks up that exact same audio and feeds it right back to the model. The AI essentially hears itself and starts "stuttering."
Our research showed that forcing a bare ESP32-S3 chip to adequately handle real-time echo cancellation requires either a specialized microphone array or a dedicated hardware DSP. An attempt to integrate WebRTC to solve this issue also fell short.
The Bottom Line: For now, the system operates strictly in half-duplex mode. While the model is speaking, it cannot hear you. This is a conscious trade-off we made in favor of rock-solid stability. For those who absolutely need full-duplex capabilities, we’ve developed a lightweight Mac app—if you wear headphones, the system works flawlessly.
Hardware: Keep It Simple and Accessible
The current smart speaker architecture is radically simplified so anyone can replicate it. You only need three components:
-
ESP32-S3 (the computing core)
-
Speaker Amplifier
-
Microphone
These are standard off-the-shelf DIY modules that are incredibly easy to wire up on a breadboard or pack into a custom enclosure.
For Advanced Makers: We've included Gerber files in the repository for a custom printed circuit board (PCB). You can easily have them fabricated at JLCPCB or any other board house, giving you a professional-grade device for pennies.
Software: Skip the IDE Headaches
Typically, ESP32 projects require installing heavy toolchains, configuring environments, and sitting through agonizing compile times. We took a different route.
Since our Wake Word and noise filtering rely on specific Espressif libraries (ESP-SR) that are notoriously tricky to integrate into a standard Arduino sketch, we built a streamlined auto-configuration system instead.
-
Terminal-Based Config: The
configure_settingsscript launches a user-friendly, text-based UI directly in your console. Wi-Fi credentials, model selection, API tokens—everything is set up right there. -
Two-Command Flashing: You don't need an IDE at all. The
run_uploadscript handles a clean build from scratch, whilerun_upload_firmwareupdates only your specific settings.
All dependencies are pulled in automatically. The entire process boils down to three simple steps: clone the repo, run the config script, and flash the board. Schematics, comprehensive documentation, and pre-compiled binaries are already waiting for you in the repository.