Voice AI Pricing Calculator

Component	Units	Cost	%
LLM Input	—	—	—
LLM Output	—	—	—
TTS (ElevenLabs)	—	—	—
STT	—	—	—
Total per Call	—	100%
Cost per Minute	—

The Voice AI Pipeline

Every voice AI call follows a real-time loop. Each iteration of this loop incurs costs from three services:

1. STTHuman speaks → audio streamed to Speech-to-Text

2. LLMTranscribed text + history sent to language model

3. TTSLLM response text synthesized to speech audio

This loop repeats on every turn (one human→AI exchange). A typical call has 5–8 turns per minute, meaning all three services are invoked 5–8 times per minute of call time.

Context Accumulation — The #1 Cost Driver

The single most important concept in voice AI pricing is context accumulation. On every turn, the LLM receives:

LLM Input = System Prompt + Full Conversation History + Current User Input

The conversation history grows with every turn. Since you pay for all input tokens on every call, the cumulative cost grows quadratically:

Turn	History Tokens	Total Input This Turn	Cumulative Input Tokens
1	0	1,220	1,220
5	280	1,500	7,000
10	630	1,850	15,350
20	1,330	2,550	37,400
30	2,030	3,250	67,050

Based on: 1,200-token system prompt, 20 tk human input, 50 tk AI output per turn

Key Insight: A 10-minute call (60 turns) doesn't cost 2× a 5-minute call (30 turns) — it costs roughly 3–4× because you're re-processing an ever-growing history on each turn. This is why context caching and history trimming are critical for profitability.

Context Caching — Your Savings Multiplier

OpenAI caches the prompt prefix (system prompt + old conversation history) across sequential API calls. When the same prefix is seen again, those tokens are billed at the cached rate:

Full Input Rate$2.00 / 1M tokens

Cached Input Rate$0.50 / 1M tokens

75% savings

Since the system prompt and previous history are always the same prefix on subsequent turns, nearly all accumulated input tokens benefit from caching. Only the brand-new user input (~20 tokens/turn) is billed at the full rate.

In our default 5-minute call example, caching reduces total LLM input cost from ~$0.134 to ~$0.034 — a 4× reduction.

Caching Math — Per-Turn Breakdown

Here's exactly what gets billed as cached vs new on each turn (using S = 1,200 system prompt, H = 20 human tokens, A = 50 AI tokens):

Turn	LLM Receives	Cached Portion @ $0.50/1M	New Portion @ $2.00/1M	Total Input
1	[SysPrompt] + [User₁]	1,200 tk (sys prompt)*	20 tk (user input)	1,220 tk
2	[SysPrompt] + [H₁+A₁] + [User₂]	1,270 tk (sys + 1 turn history)	20 tk	1,290 tk
3	[SysPrompt] + [H₁+A₁+H₂+A₂] + [User₃]	1,340 tk (sys + 2 turns)	20 tk	1,360 tk
10	[SysPrompt] + [9 turns history] + [User₁₀]	1,830 tk	20 tk	1,850 tk
30	[SysPrompt] + [29 turns history] + [User₃₀]	3,230 tk	20 tk	3,250 tk

* Turn 1 is technically the first call — OpenAI caches this prefix for subsequent turns. We bill it as cached since the cost model is the same for our estimation.

Key point: The cached portion grows from 1,200 → 3,230 tokens over 30 turns, but you only pay $0.50/1M for all of it. The "new" portion stays constant at just 20 tokens/turn at $2.00/1M. Without caching, ALL 3,250 tokens on turn 30 would be billed at the full $2.00/1M rate.

The Summation Formula — Step by Step

You might see the formula: Cached tokens = Σ(S + (i-1)×T). Here's what that means and how to compute it:

Setup

S = System Prompt tokens (e.g. 1,200)
H = Human tokens per turn (e.g. 20)
A = AI output tokens per turn (e.g. 50)
T = H + A = tokens added to history per turn = 70
N = Total turns (e.g. 30)

What happens on each turn?

On turn i, the cached portion (everything the LLM has already seen) is:

Cached portion on turn i = S + (i − 1) × T

This is because there are (i − 1) previous turns, each contributing T = 70 tokens to the history.

Total cached tokens across ALL turns

We sum up the cached portion for every turn from 1 to N:

Total = Σ [ S + (i−1) × T ] for i = 1 to N

Step 1: Split the sum
= Σ(S) + Σ((i−1) × T)
= N × S + T × Σ(i − 1)

Step 2: Evaluate Σ(i − 1) for i = 1 to N
= 0 + 1 + 2 + 3 + ... + (N − 1)
= N × (N − 1) / 2

Step 3: Plug in
= N × S + T × N × (N − 1) / 2

Worked Example (S=1,200, T=70, N=30)

= 30 × 1,200 + 70 × 30 × 29 / 2
= 36,000 + 70 × 435
= 36,000 + 30,450
= 66,450 cached tokens

Same formula with S=5,000

= 30 × 5,000 + 70 × 435
= 150,000 + 30,450
= 180,450 cached tokens

Notice: The T × N(N−1)/2 term is quadratic in N (grows as N²). Doubling the number of turns roughly quadruples the history re-processing cost. This is why longer calls are disproportionately expensive, and why caching is so critical.

Tool Calling — Schema, Costs & The Double-Call Penalty

What is Tool Schema Overhead?

When you give the LLM access to tools (functions), you must send the tool definitions as part of every API call. These are JSON schemas describing each tool — its name, description, parameter types, and constraints. OpenAI includes these alongside your system prompt.

Tool Schema Overhead = Number of Tools × Avg Schema Tokens per Tool

Example: 5 tools × 80 tokens/tool = 400 tokens

These 400 tokens are added to your effective system prompt on every single turn,
regardless of whether a tool is actually called on that turn.

A typical tool definition looks like this (simplified):

{"type": "function", "function": {
  "name": "search_knowledge_base",
  "description": "Search the KB for relevant articles",
  "parameters": {
    "type": "object",
    "properties": {
      "query": {"type": "string", "description": "Search query"},
      "limit": {"type": "integer", "description": "Max results"}
    }
  }
}}

A schema like this is roughly 60–100 tokens. Complex tools with many parameters, enums, or nested objects can reach 150–200 tokens each.

Impact: With 5 tools at 80 tk each = 400 extra tokens on every turn. Over 30 turns, that's 400 × 30 = 12,000 additional input tokens processed. With caching, most of this is billed at the cheaper $0.50/1M rate (since tool definitions are part of the cached prefix), but it still adds up.

The Double-Call Penalty

When the LLM decides to invoke a tool (e.g., booking an appointment, querying a knowledge base), it triggers two LLM calls for that single turn:

Call 1 — Decide & Call Tool

Input: System Prompt + Tool Schemas + History + User Input

Output: Tool call arguments (~30 tokens)

Tool executes & returns result

Call 2 — Generate Response

Input: System Prompt + Tool Schemas + History + User Input + Tool Args + Tool Result

Output: Final conversational response to the human

This means tool-call turns process approximately double the input context compared to normal turns. The "Tool Result Size" variable represents the average tokens returned by the tool execution (KB search results, API responses, etc.).

TTS Cost — Characters, Speed & Wastage

AI Speaking Speed (WPM)

This variable controls how many characters the TTS engine must generate. It represents the words-per-minute rate of synthesized speech. Normal conversational speed is 130–150 WPM. The formula:

Characters = Call Duration × AI Speaking Ratio × Words/Min × 5.5 chars/word

At 140 WPM with 40% AI ratio in a 5-min call: 5 × 0.4 × 140 × 5.5 = 1,540 characters.
At 120 WPM: only 1,320 characters (14% less TTS cost).
At 160 WPM: 1,760 characters (14% more TTS cost).

Interruption Wastage

When a human interrupts the AI mid-sentence, the TTS engine has already generated (and billed) the full response. Only a fraction is actually played to the caller.

Example: AI generates a 200-character response. Human interrupts after 50 characters. You paid for all 200 characters but only used 50. With 15% wastage, for every 1,000 useful characters, you generate and pay for ~1,150 characters.

STT Billing Models

Soniox (Token-Based)

Input audio tokens: Billed for entire streaming session duration (including silence and AI speech). 1 hour of audio ≈ 30,000 tokens → 500 tokens/min.
Output text tokens: Billed for transcribed text only. 1 character ≈ 0.3 tokens. Scales with how much the human speaks.

Deepgram Nova-3 (Per-Minute)

Flat rate: $0.0058 per connected minute.
Bills for full call duration regardless of speech activity.
Simpler to estimate, but typically more expensive than Soniox for longer calls.

Variable Reference

Variable	What It Controls	Typical Range
Call Duration	Total connected time. Longer calls = more context accumulation = exponentially higher LLM cost	2–15 min
Turns / Minute	Number of human→AI exchange cycles per minute. More turns = faster history growth	5–8
System Prompt	Base instruction set sent on every LLM call. Includes agent persona, rules, guidelines, flows	800–10,000 tokens
Human Input / Turn	Avg. LLM tokens from transcribed human speech per turn. 20 tokens ≈ one short sentence	15–60 tokens
AI Output / Turn	Avg. LLM tokens per AI response. 50 tokens ≈ 2–3 spoken sentences	30–120 tokens
Context Caching	Enables OpenAI prompt prefix caching at 75% discount on cached tokens	On / Off
Tool Calls / Call	Number of function calls per call (KB lookup, booking, API calls). Each triggers 2 LLM calls	0–10
Number of Tools	How many function/tool definitions the LLM has access to. Each tool's schema is sent on every turn	2–10
Avg Schema / Tool	Avg. tokens per tool definition (name + description + parameters JSON). Total overhead = numTools × avgSchema	60–150 tokens
Tool Result Size	Avg. tokens returned by tool executions (KB results, API responses)	100–500 tokens
AI Speaking Ratio	% of call duration where AI is actively speaking. Rest is human speech + silence	30–50%
AI Speaking Speed	Words per minute of synthesized speech. Determines characters generated by TTS	120–160 WPM
Interruption Waste	% of TTS characters generated but never played due to human interrupting	10–25%

Worked Example — Standard 5-Minute Call

Settings: 5 min, 6 turns/min, Soniox STT, GPT-4.1 with caching, ElevenLabs Flash v2.5, 40% AI ratio, 140 WPM, 15% interruption waste, no tool calls

Step 1: STT Cost (Soniox)

Input audio tokens = 5 min × 500 tokens/min = 2,500 tokens
Input cost = (2,500 / 1,000,000) × $2.00 = $0.0050

Human characters = 30 turns × 20 tk/turn × 4 chars/tk = 2,400 chars
Output text tokens = 2,400 × 0.3 = 720 tokens
Output cost = (720 / 1,000,000) × $4.00 = $0.0029

Total STT = $0.0079

Step 2: TTS Cost (ElevenLabs)

AI speaking time = 5 × 40% = 2 minutes
Words spoken = 2 × 140 = 280 words
Raw characters = 280 × 5.5 = 1,540 chars
With 15% wastage = 1,540 × 1.15 = 1,771 chars
Cost = (1,771 / 1,000) × $0.05 = $0.0886

Step 3: LLM Cost (GPT-4.1, cached)

S = 1,200 (system prompt), H = 20, A = 50, T = H + A = 70 tokens/turn, N = 30

Cached tokens = N×S + T×N(N−1)/2
= 30 × 1,200 + 70 × 30 × 29 / 2
= 36,000 + 70 × 435 = 36,000 + 30,450 = 66,450 tokens
Cached cost = (66,450 / 1,000,000) × $0.50 = $0.0332

New input (only current user speech, billed at full rate):
= N × H = 30 × 20 = 600 tokens
New cost = (600 / 1,000,000) × $2.00 = $0.0012

Output tokens = N × A = 30 × 50 = 1,500 tokens
Output cost = (1,500 / 1,000,000) × $8.00 = $0.0120

Total LLM = $0.0332 + $0.0012 + $0.0120 = $0.0464

Final Result

STT (Soniox)$0.0079

TTS (ElevenLabs)$0.0886

LLM Input (cached)$0.0344

LLM Output$0.0120

Total per Call$0.1429

Blended Cost / Min$0.0286

Without caching, the same call would cost $0.2345 — caching saves 39% on total cost.