The Voice AI Pipeline
Every voice AI call follows a real-time loop. Each iteration of this loop incurs costs from three services:
1. STTHuman speaks → audio streamed to Speech-to-Text
2. LLMTranscribed text + history sent to language model
3. TTSLLM response text synthesized to speech audio
This loop repeats on every turn (one human→AI exchange). A typical call has 5–8 turns per minute, meaning all three services are invoked 5–8 times per minute of call time.
Context Accumulation — The #1 Cost Driver
The single most important concept in voice AI pricing is context accumulation. On every turn, the LLM receives:
LLM Input = System Prompt + Full Conversation History + Current User Input
The conversation history grows with every turn. Since you pay for all input tokens on every call, the cumulative cost grows quadratically:
| Turn | History Tokens | Total Input This Turn | Cumulative Input Tokens |
| 1 | 0 | 1,220 | 1,220 |
| 5 | 280 | 1,500 | 7,000 |
| 10 | 630 | 1,850 | 15,350 |
| 20 | 1,330 | 2,550 | 37,400 |
| 30 | 2,030 | 3,250 | 67,050 |
Based on: 1,200-token system prompt, 20 tk human input, 50 tk AI output per turn
Key Insight: A 10-minute call (60 turns) doesn't cost 2× a 5-minute call (30 turns) — it costs roughly 3–4× because you're re-processing an ever-growing history on each turn. This is why context caching and history trimming are critical for profitability.
Context Caching — Your Savings Multiplier
OpenAI caches the prompt prefix (system prompt + old conversation history) across sequential API calls. When the same prefix is seen again, those tokens are billed at the cached rate:
Full Input Rate$2.00 / 1M tokens
Cached Input Rate$0.50 / 1M tokens
75% savings
Since the system prompt and previous history are always the same prefix on subsequent turns, nearly all accumulated input tokens benefit from caching. Only the brand-new user input (~20 tokens/turn) is billed at the full rate.
In our default 5-minute call example, caching reduces total LLM input cost from ~$0.134 to ~$0.034 — a 4× reduction.
Caching Math — Per-Turn Breakdown
Here's exactly what gets billed as cached vs new on each turn (using S = 1,200 system prompt, H = 20 human tokens, A = 50 AI tokens):
| Turn | LLM Receives | Cached Portion @ $0.50/1M | New Portion @ $2.00/1M | Total Input |
| 1 | [SysPrompt] + [User₁] | 1,200 tk (sys prompt)* | 20 tk (user input) | 1,220 tk |
| 2 | [SysPrompt] + [H₁+A₁] + [User₂] | 1,270 tk (sys + 1 turn history) | 20 tk | 1,290 tk |
| 3 | [SysPrompt] + [H₁+A₁+H₂+A₂] + [User₃] | 1,340 tk (sys + 2 turns) | 20 tk | 1,360 tk |
| 10 | [SysPrompt] + [9 turns history] + [User₁₀] | 1,830 tk | 20 tk | 1,850 tk |
| 30 | [SysPrompt] + [29 turns history] + [User₃₀] | 3,230 tk | 20 tk | 3,250 tk |
* Turn 1 is technically the first call — OpenAI caches this prefix for subsequent turns. We bill it as cached since the cost model is the same for our estimation.
Key point: The cached portion grows from 1,200 → 3,230 tokens over 30 turns, but you only pay $0.50/1M for all of it. The "new" portion stays constant at just 20 tokens/turn at $2.00/1M. Without caching, ALL 3,250 tokens on turn 30 would be billed at the full $2.00/1M rate.
The Summation Formula — Step by Step
You might see the formula: Cached tokens = Σ(S + (i-1)×T). Here's what that means and how to compute it:
Setup
S = System Prompt tokens (e.g. 1,200)
H = Human tokens per turn (e.g. 20)
A = AI output tokens per turn (e.g. 50)
T = H + A = tokens added to history per turn = 70
N = Total turns (e.g. 30)
What happens on each turn?
On turn i, the cached portion (everything the LLM has already seen) is:
Cached portion on turn i = S + (i − 1) × T
This is because there are (i − 1) previous turns, each contributing T = 70 tokens to the history.
Total cached tokens across ALL turns
We sum up the cached portion for every turn from 1 to N:
Total = Σ [ S + (i−1) × T ] for i = 1 to N
Step 1: Split the sum
= Σ(S) + Σ((i−1) × T)
= N × S + T × Σ(i − 1)
Step 2: Evaluate Σ(i − 1) for i = 1 to N
= 0 + 1 + 2 + 3 + ... + (N − 1)
= N × (N − 1) / 2
Step 3: Plug in
= N × S + T × N × (N − 1) / 2
Worked Example (S=1,200, T=70, N=30)
= 30 × 1,200 + 70 × 30 × 29 / 2
= 36,000 + 70 × 435
= 36,000 + 30,450
= 66,450 cached tokens
Same formula with S=5,000
= 30 × 5,000 + 70 × 435
= 150,000 + 30,450
= 180,450 cached tokens
Notice: The T × N(N−1)/2 term is quadratic in N (grows as N²). Doubling the number of turns roughly quadruples the history re-processing cost. This is why longer calls are disproportionately expensive, and why caching is so critical.
Tool Calling — Schema, Costs & The Double-Call Penalty
What is Tool Schema Overhead?
When you give the LLM access to tools (functions), you must send the tool definitions as part of every API call. These are JSON schemas describing each tool — its name, description, parameter types, and constraints. OpenAI includes these alongside your system prompt.
Tool Schema Overhead = Number of Tools × Avg Schema Tokens per Tool
Example: 5 tools × 80 tokens/tool = 400 tokens
These 400 tokens are added to your effective system prompt on every single turn,
regardless of whether a tool is actually called on that turn.
A typical tool definition looks like this (simplified):
{"type": "function", "function": {
"name": "search_knowledge_base",
"description": "Search the KB for relevant articles",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"limit": {"type": "integer", "description": "Max results"}
}
}
}}
A schema like this is roughly 60–100 tokens. Complex tools with many parameters, enums, or nested objects can reach 150–200 tokens each.
Impact: With 5 tools at 80 tk each = 400 extra tokens on every turn. Over 30 turns, that's 400 × 30 = 12,000 additional input tokens processed. With caching, most of this is billed at the cheaper $0.50/1M rate (since tool definitions are part of the cached prefix), but it still adds up.
The Double-Call Penalty
When the LLM decides to invoke a tool (e.g., booking an appointment, querying a knowledge base), it triggers two LLM calls for that single turn:
Call 1 — Decide & Call Tool
Input: System Prompt + Tool Schemas + History + User Input
Output: Tool call arguments (~30 tokens)
Tool executes & returns result
Call 2 — Generate Response
Input: System Prompt + Tool Schemas + History + User Input + Tool Args + Tool Result
Output: Final conversational response to the human
This means tool-call turns process approximately double the input context compared to normal turns. The "Tool Result Size" variable represents the average tokens returned by the tool execution (KB search results, API responses, etc.).
TTS Cost — Characters, Speed & Wastage
AI Speaking Speed (WPM)
This variable controls how many characters the TTS engine must generate. It represents the words-per-minute rate of synthesized speech. Normal conversational speed is 130–150 WPM. The formula:
Characters = Call Duration × AI Speaking Ratio × Words/Min × 5.5 chars/word
At 140 WPM with 40% AI ratio in a 5-min call: 5 × 0.4 × 140 × 5.5 = 1,540 characters.
At 120 WPM: only 1,320 characters (14% less TTS cost).
At 160 WPM: 1,760 characters (14% more TTS cost).
Interruption Wastage
When a human interrupts the AI mid-sentence, the TTS engine has already generated (and billed) the full response. Only a fraction is actually played to the caller.
Example: AI generates a 200-character response. Human interrupts after 50 characters. You paid for all 200 characters but only used 50. With 15% wastage, for every 1,000 useful characters, you generate and pay for ~1,150 characters.
STT Billing Models
Soniox (Token-Based)
- Input audio tokens: Billed for entire streaming session duration (including silence and AI speech). 1 hour of audio ≈ 30,000 tokens → 500 tokens/min.
- Output text tokens: Billed for transcribed text only. 1 character ≈ 0.3 tokens. Scales with how much the human speaks.
Deepgram Nova-3 (Per-Minute)
- Flat rate: $0.0058 per connected minute.
- Bills for full call duration regardless of speech activity.
- Simpler to estimate, but typically more expensive than Soniox for longer calls.
Variable Reference
| Variable | What It Controls | Typical Range |
| Call Duration | Total connected time. Longer calls = more context accumulation = exponentially higher LLM cost | 2–15 min |
| Turns / Minute | Number of human→AI exchange cycles per minute. More turns = faster history growth | 5–8 |
| System Prompt | Base instruction set sent on every LLM call. Includes agent persona, rules, guidelines, flows | 800–10,000 tokens |
| Human Input / Turn | Avg. LLM tokens from transcribed human speech per turn. 20 tokens ≈ one short sentence | 15–60 tokens |
| AI Output / Turn | Avg. LLM tokens per AI response. 50 tokens ≈ 2–3 spoken sentences | 30–120 tokens |
| Context Caching | Enables OpenAI prompt prefix caching at 75% discount on cached tokens | On / Off |
| Tool Calls / Call | Number of function calls per call (KB lookup, booking, API calls). Each triggers 2 LLM calls | 0–10 |
| Number of Tools | How many function/tool definitions the LLM has access to. Each tool's schema is sent on every turn | 2–10 |
| Avg Schema / Tool | Avg. tokens per tool definition (name + description + parameters JSON). Total overhead = numTools × avgSchema | 60–150 tokens |
| Tool Result Size | Avg. tokens returned by tool executions (KB results, API responses) | 100–500 tokens |
| AI Speaking Ratio | % of call duration where AI is actively speaking. Rest is human speech + silence | 30–50% |
| AI Speaking Speed | Words per minute of synthesized speech. Determines characters generated by TTS | 120–160 WPM |
| Interruption Waste | % of TTS characters generated but never played due to human interrupting | 10–25% |
Worked Example — Standard 5-Minute Call
Settings: 5 min, 6 turns/min, Soniox STT, GPT-4.1 with caching, ElevenLabs Flash v2.5, 40% AI ratio, 140 WPM, 15% interruption waste, no tool calls
Step 1: STT Cost (Soniox)
Input audio tokens = 5 min × 500 tokens/min = 2,500 tokens
Input cost = (2,500 / 1,000,000) × $2.00 = $0.0050
Human characters = 30 turns × 20 tk/turn × 4 chars/tk = 2,400 chars
Output text tokens = 2,400 × 0.3 = 720 tokens
Output cost = (720 / 1,000,000) × $4.00 = $0.0029
Total STT = $0.0079
Step 2: TTS Cost (ElevenLabs)
AI speaking time = 5 × 40% = 2 minutes
Words spoken = 2 × 140 = 280 words
Raw characters = 280 × 5.5 = 1,540 chars
With 15% wastage = 1,540 × 1.15 = 1,771 chars
Cost = (1,771 / 1,000) × $0.05 = $0.0886
Step 3: LLM Cost (GPT-4.1, cached)
S = 1,200 (system prompt), H = 20, A = 50, T = H + A = 70 tokens/turn, N = 30
Cached tokens = N×S + T×N(N−1)/2
= 30 × 1,200 + 70 × 30 × 29 / 2
= 36,000 + 70 × 435 = 36,000 + 30,450 = 66,450 tokens
Cached cost = (66,450 / 1,000,000) × $0.50 = $0.0332
New input (only current user speech, billed at full rate):
= N × H = 30 × 20 = 600 tokens
New cost = (600 / 1,000,000) × $2.00 = $0.0012
Output tokens = N × A = 30 × 50 = 1,500 tokens
Output cost = (1,500 / 1,000,000) × $8.00 = $0.0120
Total LLM = $0.0332 + $0.0012 + $0.0120 = $0.0464
Final Result
STT (Soniox)$0.0079
TTS (ElevenLabs)$0.0886
LLM Input (cached)$0.0344
LLM Output$0.0120
Total per Call$0.1429
Blended Cost / Min$0.0286
Without caching, the same call would cost $0.2345 — caching saves 39% on total cost.