Playing Audio on Your Device
The voice-assistant server gives you two ways to trigger audio playback on a connected device:
- Play a pre-recorded WAV file directly.
- Synthesize speech from text using the model's built-in TTS (via Gemini).
Option 1: Playing a WAV File (No TTS)
Endpoint:
POST /api/device/playback/wav
Authorization: Bearer <device_token>
Content-Type: multipart/form-data
file=@your_audio.wav
Example:
curl -k -X POST https://<voice-assistant-ip-address>:8100/api/device/playback/wav \
-H "Authorization: Bearer YOUR_DEVICE_TOKEN" \
-F "file=@alert.wav"
What happens under the hood:
- The server identifies your device by its token.
- Your WAV file is read and converted locally to the target format: PCM 16-bit, mono, 16 kHz.
- The session is automatically closed once playback finishes.
Note: In this mode, the model is never involved — your original audio file is played as-is.
Option 2: Text-to-Speech Playback (TTS via Model)
Endpoint:
POST /api/device/playback/text
Authorization: Bearer <device_token>
Content-Type: application/json
Example:
curl -k -X POST https://<voice-assistant-ip-address>:8100/api/device/playback/text \
-H "Authorization: Bearer YOUR_DEVICE_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"message": "Warning, the door is open",
"specialSession": "READER",
"closeClientAfterCompletion": true
}'
What happens under the hood:
- The server spins up a temporary one-time special session for your device.
- It sends a prompt to the model along the lines of "Read the following text aloud, word for word..."
- The model synthesizes the audio and streams it directly to your device.
- Once the speech finishes, the session is automatically torn down.
The voice used for TTS is pulled from your FunctionSet.
Tips: You can embed emotional cues like (urgent), (questioning), or (cheerful) directly into the text to add vocal expression to the synthesized speech. For example: "message": "Warning! (urgent) Severe weather alert! Black ice and hazardous driving conditions are expected tomorrow."
What Runs Locally vs. What Hits the Model
| Local | Model |
|---|---|
| Auth & routing | Voice synthesis |
| Session management | Audio stream output |
| Audio queuing & delivery | |
| Session teardown |
Everything except the actual voice generation happens on your own machine — the model's only job is to turn text into audio and hand it back.