Playing Audio on Your Device

The voice-assistant server gives you two ways to trigger audio playback on a connected device:

Play a pre-recorded WAV file directly.
Synthesize speech from text using the model's built-in TTS (via Gemini).

Option 1: Playing a WAV File (No TTS)

Endpoint:

POST /api/device/playback/wav
Authorization: Bearer <device_token>
Content-Type: multipart/form-data
file=@your_audio.wav

Example:

curl -k -X POST https://<voice-assistant-ip-address>:8100/api/device/playback/wav \
  -H "Authorization: Bearer YOUR_DEVICE_TOKEN" \
  -F "file=@alert.wav"

What happens under the hood:

The server identifies your device by its token.
Your WAV file is read and converted locally to the target format: PCM 16-bit, mono, 16 kHz.
The session is automatically closed once playback finishes.

Note: In this mode, the model is never involved — your original audio file is played as-is.

Option 2: Text-to-Speech Playback (TTS via Model)

Endpoint:

POST /api/device/playback/text
Authorization: Bearer <device_token>
Content-Type: application/json

Example:

curl -k -X POST https://<voice-assistant-ip-address>:8100/api/device/playback/text \
  -H "Authorization: Bearer YOUR_DEVICE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Warning, the door is open",
    "specialSession": "READER",
    "closeClientAfterCompletion": true
  }'

What happens under the hood:

The server spins up a temporary one-time special session for your device.
It sends a prompt to the model along the lines of "Read the following text aloud, word for word..."
The model synthesizes the audio and streams it directly to your device.
Once the speech finishes, the session is automatically torn down.

The voice used for TTS is pulled from your FunctionSet.

Tips: You can embed emotional cues like (urgent), (questioning), or (cheerful) directly into the text to add vocal expression to the synthesized speech. For example: "message": "Warning! (urgent) Severe weather alert! Black ice and hazardous driving conditions are expected tomorrow."

What Runs Locally vs. What Hits the Model

Local	Model
Auth & routing	Voice synthesis
Session management	Audio stream output
Audio queuing & delivery
Session teardown

Everything except the actual voice generation happens on your own machine — the model's only job is to turn text into audio and hand it back.