Template — architecture pattern, not a starter kit
1
Inbound call via Twilio/Vonage

Telephony provider routes call to your osModa server.

2
STT + LLM reasoning

Transcribe speech, process with LLM, generate response.

3
TTS + respond

Convert LLM response to speech and deliver to caller.

Deploy This PatternRecommended: Pro · $34.99/mo + Twilio/Vonage subscription

Voice Agent Template

This template describes the architecture for a voice/telephony agent on osModa. An inbound call arrives via Twilio, Vonage, or a similar telephony provider, is routed to your voice server running on osModa, transcribed to text (STT), processed by an LLM for reasoning and response generation, converted back to speech (TTS), and delivered to the caller. osmoda-watch keeps the voice server running 24/7, and every call is logged to the SHA-256 audit ledger.

This is an architecture pattern, not a downloadable voice assistant. It describes how call audio flows through your osModa server, which daemons handle each concern, and how to integrate with external telephony and LLM APIs. You bring your own STT/TTS engines (Whisper, Deepgram, ElevenLabs, or similar) and telephony provider.

Important

osModa does not include telephony infrastructure. You need a separate Twilio, Vonage, or similar subscription for phone numbers and call routing. osModa provides the server that runs your voice processing logic -- STT, LLM reasoning, TTS, and call state management.

TL;DR

  • • Telephony via Twilio/Vonage (separate subscription) -- osModa hosts the processing server
  • • Data flow: Inbound call → STT → LLM reasoning → TTS → Response
  • • osmoda-egress allowlists Twilio/Vonage API and your LLM provider endpoint
  • • osmoda-watch keeps voice server running 24/7 with auto-restart on crash
  • • SHA-256 audit ledger logs every call: timestamps, duration, transcript, response
  • • osmoda-routines for scheduled outbound campaigns (reminders, follow-ups)
  • • Recommended plan: Pro ($34.99/mo) for real-time audio processing

Architecture Diagram

The data flow for a voice/telephony agent on osModa.

┌──────────────────────────────────────────┐
│         INBOUND CALL                     │
│  caller dials your Twilio/Vonage number  │
│  telephony provider routes to osModa     │
└──────────────────┬───────────────────────┘
                   ▼
┌──────────────────────────────────────────┐
│         VOICE SERVER (your code)         │
│  running on osModa                       │
│  supervised by osmoda-watch (24/7)       │
│  receives audio stream from provider     │
└──────────────────┬───────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────┐
│         SPEECH-TO-TEXT (STT)             │
│  Whisper, Deepgram, or similar           │
│  transcribe caller audio to text         │
└──────────────────┬───────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────┐
│         LLM REASONING                    │
│  via osmoda-egress allowlisted API       │
│  Claude / GPT-4o / o3-mini              │
│  understand intent, generate response    │
└──────────────────┬───────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────┐
│         TEXT-TO-SPEECH (TTS)             │
│  ElevenLabs, Coqui, or similar          │
│  convert LLM response to audio           │
└──────────────────┬───────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────┐
│         RESPONSE                         │
│  stream synthesized audio back to caller │
│  via Twilio/Vonage connection            │
└──────────────────────────────────────────┘

┌──────────────────────────────────────────┐
│  osmoda-routines: scheduled outbound     │
│  campaigns (reminders, follow-ups)       │
└──────────────────────────────────────────┘
┌──────────────────────────────────────────┐
│  AUDIT LEDGER (SHA-256)                  │
│  logs every call: time, transcript, resp │
└──────────────────────────────────────────┘

Components

The building blocks of this voice agent architecture.

Telephony Provider

Twilio, Vonage, or similar service that provides phone numbers and call routing. Not included in osModa -- requires a separate subscription. Routes inbound calls to your voice server via WebSocket or SIP.

Voice Server

Your code running on osModa that orchestrates the call flow. Receives audio from the telephony provider, coordinates STT, LLM, and TTS stages, and streams the response back. Supervised by osmoda-watch for 24/7 uptime.

STT Engine

Speech-to-text engine that transcribes caller audio. You can run Whisper locally on the server, or use a cloud STT API (Deepgram, Google, etc.) through osmoda-egress. Local STT avoids per-minute API charges but uses more server CPU.

LLM Processor

Sends the transcript to an LLM (Claude Opus/Sonnet/Haiku, GPT-4o, o3-mini) via osmoda-egress for intent understanding and response generation. The LLM interprets what the caller wants and generates a natural language response.

TTS Engine

Text-to-speech engine that converts the LLM response to audio. ElevenLabs, Coqui TTS (local), or cloud TTS APIs via osmoda-egress. The synthesized audio is streamed back to the caller through the telephony provider connection.

Call Logger

Every call is logged to the SHA-256 audit ledger: timestamps, call duration, caller ID, transcript, LLM response, and a tamper-evident hash. Provides a complete audit trail of all voice interactions.

osModa Features Used

The specific daemons and platform capabilities this template relies on.

E

osmoda-egress

Allowlists the Twilio/Vonage API, your LLM provider endpoint (Anthropic, OpenAI), and any cloud STT/TTS APIs you use. Blocks all other outbound connections. Prevents the voice server from reaching unauthorized services.

W

osmoda-watch

Keeps the voice server running 24/7. If the process crashes, osmoda-watch restarts it immediately. Critical for telephony -- a down server means missed calls. Configured for immediate restart with no backoff delay.

A

SHA-256 Audit Ledger

Logs every call with tamper-evident SHA-256 hashes. Records call timestamps, duration, transcripts, LLM responses, and outcomes. Essential for compliance, quality assurance, and debugging voice interaction issues.

R

osmoda-routines

Schedules outbound call campaigns via cron. Appointment reminders, follow-up calls, survey campaigns -- any recurring outbound calling pattern. Your agent initiates calls through the Twilio/Vonage API at the scheduled times.

Step-by-Step Setup

How to implement this architecture pattern on your osModa server.

  1. 1

    Set up Twilio or Vonage account

    Create an account with Twilio, Vonage, or your preferred telephony provider. Purchase a phone number and configure it to route calls to a webhook URL. This is separate from osModa and billed by the telephony provider.

  2. 2

    Spawn a Pro server and SSH in

    Go to spawn.os.moda and create a Pro server ($34.99/mo). SSH in with your key. All 9 Rust daemons are already running. Pro provides 4 CPU and 8 GB RAM for real-time audio processing.

  3. 3

    Configure osmoda-egress allowlist

    Add your telephony provider API (api.twilio.com, api.vonage.com), your LLM provider (api.anthropic.com, api.openai.com), and any cloud STT/TTS APIs to the osmoda-egress allowlist.

  4. 4

    Build the voice server

    Write the voice server that receives audio from the telephony provider, runs STT (local Whisper or cloud API), sends transcript to LLM, converts response via TTS, and streams audio back. Handle conversation state between turns.

  5. 5

    Register with osmoda-watch for 24/7 uptime

    Register the voice server with osmoda-watch using immediate restart policy. The voice server must be available whenever a call comes in. Test the crash recovery by verifying the process restarts within seconds.

  6. 6

    Configure outbound campaigns (optional)

    If you need scheduled outbound calls, set up osmoda-routines with cron expressions. Your agent code initiates calls via the Twilio/Vonage API at the scheduled times. Each call is logged to the audit ledger.

Recommended Plan

Voice agents need more compute than text-based agents. Real-time audio processing -- STT, LLM, and TTS running in sequence with low latency -- requires adequate CPU and RAM.

Pro — $34.99/mo

4 CPU · 8 GB RAM · 80 GB disk

Handles real-time STT, LLM reasoning, and TTS for voice calls. 4 CPU cores manage concurrent audio streams, and 8 GB RAM accommodates speech models if running STT/TTS locally. For high-volume concurrent calls, consider Team ($62.99/mo, 8 CPU, 16 GB RAM).

Additional costs

Twilio/Vonage: ~$1/mo per phone number + per-minute call charges. LLM API: varies by provider and token volume. Cloud STT/TTS: varies by provider if not running locally.

Frequently Asked Questions

Is this a downloadable voice assistant?

No. This is an architecture pattern describing how to design a voice/telephony agent on osModa. It maps the audio processing pipeline from inbound call through speech-to-text, LLM reasoning, and text-to-speech back to spoken response. You build the voice agent yourself using a telephony provider (Twilio, Vonage, or similar) for call handling, and deploy the processing logic on your osModa server.

Does osModa include telephony infrastructure?

No. osModa does not include telephony infrastructure -- no phone numbers, no SIP trunks, no call routing. You need a separate subscription with Twilio, Vonage, or a similar telephony provider for the actual call handling. osModa hosts the voice server that processes calls: running your STT engine, LLM reasoning, and TTS engine. The telephony provider handles the phone network; osModa handles the compute.

What does the total cost look like?

You need both osModa hosting and a telephony provider subscription. osModa Pro is $34.99/mo for the server that runs your voice processing. Twilio or Vonage costs vary by usage -- typically $1/mo per phone number plus per-minute charges for calls. You also pay for LLM API usage (Anthropic, OpenAI, etc.) based on token volume. The total depends on your call volume and LLM provider pricing.

How does osmoda-watch help with voice agents?

Voice agents must be available 24/7 -- if a customer calls and your server is down, the call fails. osmoda-watch supervises your voice server process and restarts it immediately if it crashes. This is critical for telephony applications where downtime directly means missed calls. The crash and restart are logged to the audit ledger.

Can I run scheduled outbound call campaigns?

Yes. osmoda-routines supports cron-based scheduling, so you can trigger outbound call campaigns at specific times -- for example, appointment reminders every morning at 9 AM, or follow-up calls every Tuesday. Your agent code initiates the calls via the Twilio/Vonage API through osmoda-egress, and osmoda-routines handles the scheduling.

What plan is recommended for a voice agent?

Pro ($34.99/mo, 4 CPU, 8 GB RAM, 80 GB disk) is recommended. Real-time audio processing -- running STT, LLM inference, and TTS with low latency -- requires more compute than text-based agents. The 4 CPU cores handle concurrent audio streams, and the 8 GB RAM accommodates speech processing models. For high-volume concurrent calls, consider Team ($62.99/mo).

Build Your Voice Agent on osModa

Spawn a dedicated server with osmoda-egress for API access control, osmoda-watch for 24/7 uptime, and osmoda-routines for outbound campaigns. From $34.99/month plus your telephony provider.

Last updated: March 2026