Voice AI Pricing Calculator

Model your pipeline costs & optimize profit margins with real-world parameters

Quick Presets
Call Metrics
Call Duration5.0 min
1 min30 min
Turns / Minute6
215
Speech-to-Text (STT)

1 hr audio ≈ 30K input tokens · 1 char output ≈ 0.3 tokens

LLM — OpenAI GPT-4.1
System Prompt Size
tokens
2006,000+
Human Input / Turn
tk
AI Output / Turn
tk
Context Caching Cached input at $0.50/1M instead of $2.00/1M — saves 75%
Tool / Function Calling Each tool call triggers 2 LLM calls (decide + respond)
Text-to-Speech (TTS)
AI Speaking Ratio40%
10%70%
Interruption Waste15%
0%40%
AI Speaking Speed140 WPM
100 WPM180 WPM
Total Cost / Call $0.000 5.0 min call
Blended Cost / Min $0.000 Operational rate
Retail Price / Min $0.000 $0.00 / call
Gross Margin 50.0% Profit: $0.00 / call
Pricing Strategy
Target Profit Markup100%
10%300%
30 turns · 67.0K total input tokens processed · Context grows each turn
Pipeline Cost Breakdown
ComponentUnitsCost%
LLM Input
LLM Output
TTS (ElevenLabs)
STT
Total per Call100%
Cost per Minute
Cost vs Duration — Caching Impact

The Voice AI Pipeline

Every voice AI call follows a real-time loop. Each iteration of this loop incurs costs from three services:

1. STTHuman speaks → audio streamed to Speech-to-Text
2. LLMTranscribed text + history sent to language model
3. TTSLLM response text synthesized to speech audio

This loop repeats on every turn (one human→AI exchange). A typical call has 5–8 turns per minute, meaning all three services are invoked 5–8 times per minute of call time.

Context Accumulation — The #1 Cost Driver

The single most important concept in voice AI pricing is context accumulation. On every turn, the LLM receives:

LLM Input = System Prompt + Full Conversation History + Current User Input

The conversation history grows with every turn. Since you pay for all input tokens on every call, the cumulative cost grows quadratically:

TurnHistory TokensTotal Input This TurnCumulative Input Tokens
101,2201,220
52801,5007,000
106301,85015,350
201,3302,55037,400
302,0303,25067,050

Based on: 1,200-token system prompt, 20 tk human input, 50 tk AI output per turn

Key Insight: A 10-minute call (60 turns) doesn't cost 2× a 5-minute call (30 turns) — it costs roughly 3–4× because you're re-processing an ever-growing history on each turn. This is why context caching and history trimming are critical for profitability.

Context Caching — Your Savings Multiplier

OpenAI caches the prompt prefix (system prompt + old conversation history) across sequential API calls. When the same prefix is seen again, those tokens are billed at the cached rate:

Full Input Rate$2.00 / 1M tokens
Cached Input Rate$0.50 / 1M tokens
75% savings

Since the system prompt and previous history are always the same prefix on subsequent turns, nearly all accumulated input tokens benefit from caching. Only the brand-new user input (~20 tokens/turn) is billed at the full rate.

In our default 5-minute call example, caching reduces total LLM input cost from ~$0.134 to ~$0.034 — a 4× reduction.

Caching Math — Per-Turn Breakdown

Here's exactly what gets billed as cached vs new on each turn (using S = 1,200 system prompt, H = 20 human tokens, A = 50 AI tokens):

TurnLLM ReceivesCached Portion
@ $0.50/1M
New Portion
@ $2.00/1M
Total Input
1[SysPrompt] + [User₁]1,200 tk (sys prompt)*20 tk (user input)1,220 tk
2[SysPrompt] + [H₁+A₁] + [User₂]1,270 tk (sys + 1 turn history)20 tk1,290 tk
3[SysPrompt] + [H₁+A₁+H₂+A₂] + [User₃]1,340 tk (sys + 2 turns)20 tk1,360 tk
10[SysPrompt] + [9 turns history] + [User₁₀]1,830 tk20 tk1,850 tk
30[SysPrompt] + [29 turns history] + [User₃₀]3,230 tk20 tk3,250 tk

* Turn 1 is technically the first call — OpenAI caches this prefix for subsequent turns. We bill it as cached since the cost model is the same for our estimation.

Key point: The cached portion grows from 1,200 → 3,230 tokens over 30 turns, but you only pay $0.50/1M for all of it. The "new" portion stays constant at just 20 tokens/turn at $2.00/1M. Without caching, ALL 3,250 tokens on turn 30 would be billed at the full $2.00/1M rate.

The Summation Formula — Step by Step

You might see the formula: Cached tokens = Σ(S + (i-1)×T). Here's what that means and how to compute it:

Setup

S = System Prompt tokens (e.g. 1,200)
H = Human tokens per turn (e.g. 20)
A = AI output tokens per turn (e.g. 50)
T = H + A = tokens added to history per turn = 70
N = Total turns (e.g. 30)

What happens on each turn?

On turn i, the cached portion (everything the LLM has already seen) is:

Cached portion on turn i = S + (i − 1) × T

This is because there are (i − 1) previous turns, each contributing T = 70 tokens to the history.

Total cached tokens across ALL turns

We sum up the cached portion for every turn from 1 to N:

Total = Σ [ S + (i−1) × T ] for i = 1 to N

Step 1: Split the sum
= Σ(S) + Σ((i−1) × T)
= N × S + T × Σ(i − 1)

Step 2: Evaluate Σ(i − 1) for i = 1 to N
= 0 + 1 + 2 + 3 + ... + (N − 1)
= N × (N − 1) / 2

Step 3: Plug in
= N × S + T × N × (N − 1) / 2

Worked Example (S=1,200, T=70, N=30)

= 30 × 1,200 + 70 × 30 × 29 / 2
= 36,000 + 70 × 435
= 36,000 + 30,450
= 66,450 cached tokens

Same formula with S=5,000

= 30 × 5,000 + 70 × 435
= 150,000 + 30,450
= 180,450 cached tokens
Notice: The T × N(N−1)/2 term is quadratic in N (grows as N²). Doubling the number of turns roughly quadruples the history re-processing cost. This is why longer calls are disproportionately expensive, and why caching is so critical.

Tool Calling — Schema, Costs & The Double-Call Penalty

What is Tool Schema Overhead?

When you give the LLM access to tools (functions), you must send the tool definitions as part of every API call. These are JSON schemas describing each tool — its name, description, parameter types, and constraints. OpenAI includes these alongside your system prompt.

Tool Schema Overhead = Number of Tools × Avg Schema Tokens per Tool

Example: 5 tools × 80 tokens/tool = 400 tokens

These 400 tokens are added to your effective system prompt on every single turn,
regardless of whether a tool is actually called on that turn.

A typical tool definition looks like this (simplified):

{"type": "function", "function": {
  "name": "search_knowledge_base",
  "description": "Search the KB for relevant articles",
  "parameters": {
    "type": "object",
    "properties": {
      "query": {"type": "string", "description": "Search query"},
      "limit": {"type": "integer", "description": "Max results"}
    }
  }
}}

A schema like this is roughly 60–100 tokens. Complex tools with many parameters, enums, or nested objects can reach 150–200 tokens each.

Impact: With 5 tools at 80 tk each = 400 extra tokens on every turn. Over 30 turns, that's 400 × 30 = 12,000 additional input tokens processed. With caching, most of this is billed at the cheaper $0.50/1M rate (since tool definitions are part of the cached prefix), but it still adds up.

The Double-Call Penalty

When the LLM decides to invoke a tool (e.g., booking an appointment, querying a knowledge base), it triggers two LLM calls for that single turn:

Call 1 — Decide & Call Tool

Input: System Prompt + Tool Schemas + History + User Input

Output: Tool call arguments (~30 tokens)

Tool executes & returns result
Call 2 — Generate Response

Input: System Prompt + Tool Schemas + History + User Input + Tool Args + Tool Result

Output: Final conversational response to the human

This means tool-call turns process approximately double the input context compared to normal turns. The "Tool Result Size" variable represents the average tokens returned by the tool execution (KB search results, API responses, etc.).

TTS Cost — Characters, Speed & Wastage

AI Speaking Speed (WPM)

This variable controls how many characters the TTS engine must generate. It represents the words-per-minute rate of synthesized speech. Normal conversational speed is 130–150 WPM. The formula:

Characters = Call Duration × AI Speaking Ratio × Words/Min × 5.5 chars/word

At 140 WPM with 40% AI ratio in a 5-min call: 5 × 0.4 × 140 × 5.5 = 1,540 characters.
At 120 WPM: only 1,320 characters (14% less TTS cost).
At 160 WPM: 1,760 characters (14% more TTS cost).

Interruption Wastage

When a human interrupts the AI mid-sentence, the TTS engine has already generated (and billed) the full response. Only a fraction is actually played to the caller.

Example: AI generates a 200-character response. Human interrupts after 50 characters. You paid for all 200 characters but only used 50. With 15% wastage, for every 1,000 useful characters, you generate and pay for ~1,150 characters.

STT Billing Models

Soniox (Token-Based)

  • Input audio tokens: Billed for entire streaming session duration (including silence and AI speech). 1 hour of audio ≈ 30,000 tokens → 500 tokens/min.
  • Output text tokens: Billed for transcribed text only. 1 character ≈ 0.3 tokens. Scales with how much the human speaks.

Deepgram Nova-3 (Per-Minute)

  • Flat rate: $0.0058 per connected minute.
  • Bills for full call duration regardless of speech activity.
  • Simpler to estimate, but typically more expensive than Soniox for longer calls.

Variable Reference

VariableWhat It ControlsTypical Range
Call DurationTotal connected time. Longer calls = more context accumulation = exponentially higher LLM cost2–15 min
Turns / MinuteNumber of human→AI exchange cycles per minute. More turns = faster history growth5–8
System PromptBase instruction set sent on every LLM call. Includes agent persona, rules, guidelines, flows800–10,000 tokens
Human Input / TurnAvg. LLM tokens from transcribed human speech per turn. 20 tokens ≈ one short sentence15–60 tokens
AI Output / TurnAvg. LLM tokens per AI response. 50 tokens ≈ 2–3 spoken sentences30–120 tokens
Context CachingEnables OpenAI prompt prefix caching at 75% discount on cached tokensOn / Off
Tool Calls / CallNumber of function calls per call (KB lookup, booking, API calls). Each triggers 2 LLM calls0–10
Number of ToolsHow many function/tool definitions the LLM has access to. Each tool's schema is sent on every turn2–10
Avg Schema / ToolAvg. tokens per tool definition (name + description + parameters JSON). Total overhead = numTools × avgSchema60–150 tokens
Tool Result SizeAvg. tokens returned by tool executions (KB results, API responses)100–500 tokens
AI Speaking Ratio% of call duration where AI is actively speaking. Rest is human speech + silence30–50%
AI Speaking SpeedWords per minute of synthesized speech. Determines characters generated by TTS120–160 WPM
Interruption Waste% of TTS characters generated but never played due to human interrupting10–25%

Worked Example — Standard 5-Minute Call

Settings: 5 min, 6 turns/min, Soniox STT, GPT-4.1 with caching, ElevenLabs Flash v2.5, 40% AI ratio, 140 WPM, 15% interruption waste, no tool calls

Step 1: STT Cost (Soniox)

Input audio tokens = 5 min × 500 tokens/min = 2,500 tokens
Input cost = (2,500 / 1,000,000) × $2.00 = $0.0050

Human characters = 30 turns × 20 tk/turn × 4 chars/tk = 2,400 chars
Output text tokens = 2,400 × 0.3 = 720 tokens
Output cost = (720 / 1,000,000) × $4.00 = $0.0029

Total STT = $0.0079

Step 2: TTS Cost (ElevenLabs)

AI speaking time = 5 × 40% = 2 minutes
Words spoken = 2 × 140 = 280 words
Raw characters = 280 × 5.5 = 1,540 chars
With 15% wastage = 1,540 × 1.15 = 1,771 chars
Cost = (1,771 / 1,000) × $0.05 = $0.0886

Step 3: LLM Cost (GPT-4.1, cached)

S = 1,200 (system prompt), H = 20, A = 50, T = H + A = 70 tokens/turn, N = 30

Cached tokens = N×S + T×N(N−1)/2
= 30 × 1,200 + 70 × 30 × 29 / 2
= 36,000 + 70 × 435 = 36,000 + 30,450 = 66,450 tokens
Cached cost = (66,450 / 1,000,000) × $0.50 = $0.0332

New input (only current user speech, billed at full rate):
= N × H = 30 × 20 = 600 tokens
New cost = (600 / 1,000,000) × $2.00 = $0.0012

Output tokens = N × A = 30 × 50 = 1,500 tokens
Output cost = (1,500 / 1,000,000) × $8.00 = $0.0120

Total LLM = $0.0332 + $0.0012 + $0.0120 = $0.0464

Final Result

STT (Soniox)$0.0079
TTS (ElevenLabs)$0.0886
LLM Input (cached)$0.0344
LLM Output$0.0120
Total per Call$0.1429
Blended Cost / Min$0.0286

Without caching, the same call would cost $0.2345 — caching saves 39% on total cost.