Knowledge Base

Playing Audio on Your Device

Back to WIKI home

Playing Audio on Your Device

The voice-assistant server gives you two ways to trigger audio playback on a connected device:

  1. Play a pre-recorded WAV file directly.
  2. Synthesize speech from text using the model's built-in TTS (via Gemini).


Option 1: Playing a WAV File (No TTS)


Endpoint:

POST /api/device/playback/wav
Authorization: Bearer <device_token>
Content-Type: multipart/form-data
file=@your_audio.wav

Example:

curl -k -X POST https://<voice-assistant-ip-address>:8100/api/device/playback/wav \
  -H "Authorization: Bearer YOUR_DEVICE_TOKEN" \
  -F "file=@alert.wav"

What happens under the hood:

  • The server identifies your device by its token.
  • Your WAV file is read and converted locally to the target format: PCM 16-bit, mono, 16 kHz.
  • The session is automatically closed once playback finishes.

Note: In this mode, the model is never involved — your original audio file is played as-is.



Option 2: Text-to-Speech Playback (TTS via Model)


Endpoint:

POST /api/device/playback/text
Authorization: Bearer <device_token>
Content-Type: application/json

Example:

curl -k -X POST https://<voice-assistant-ip-address>:8100/api/device/playback/text \
  -H "Authorization: Bearer YOUR_DEVICE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Warning, the door is open",
    "specialSession": "READER",
    "closeClientAfterCompletion": true
  }'

What happens under the hood:

  • The server spins up a temporary one-time special session for your device.
  • It sends a prompt to the model along the lines of "Read the following text aloud, word for word..."
  • The model synthesizes the audio and streams it directly to your device.
  • Once the speech finishes, the session is automatically torn down.

The voice used for TTS is pulled from your FunctionSet.

Tips: You can embed emotional cues like (urgent), (questioning), or (cheerful) directly into the text to add vocal expression to the synthesized speech. For example: "message": "Warning! (urgent) Severe weather alert! Black ice and hazardous driving conditions are expected tomorrow."


What Runs Locally vs. What Hits the Model


Local Model
Auth & routing Voice synthesis
Session management Audio stream output
Audio queuing & delivery
Session teardown

Everything except the actual voice generation happens on your own machine — the model's only job is to turn text into audio and hand it back.