Why is output more expensive than input?

Output generation is sequential and compute-bound — the model produces one token at a time. Input is parallel-processable. Providers charge output 3–5× more to reflect actual GPU time. Output also costs more memory bandwidth on long contexts.

Is 1 million tokens a lot or a little?

1M tokens is roughly 750,000 words — about 4 average novels of text. For a chatbot doing 100k requests/month at 2k input + 400 output tokens per request, you'd use 240M tokens monthly — burning through 240M tokens × $9 blended = $2,160.

How can I reduce my per-million-token cost?

Three biggest levers: (1) switch to a cheaper model (10× cost reduction is common), (2) enable prompt caching for repeated prefixes (40–80% input cost reduction), (3) use batch APIs for non-realtime workloads (50% discount on most providers).

Do tokens cost the same in every region?

No. AWS Bedrock and Google Vertex AI charge 5–15% more in EU/APAC than us-east-1. Direct provider APIs (OpenAI, Anthropic) are uniform globally. Region surcharge matters most for high-volume workloads.

Does the model size affect token cost?

Yes — bigger models cost dramatically more per token. Claude Opus 4.7 (flagship) costs 18× Claude Haiku 4.5 per million input tokens. The cost reflects compute used per token; bigger models use more GPU memory bandwidth and FLOPS per generation step.

Blog

How Much Does 1 Million AI Tokens Cost in 2026?

Q: How much does 1 million AI tokens cost in 2026?

Between $0.06 (Amazon Nova Lite input) and $75 (Claude Opus 4.7 output) per million tokens. Average for production-grade flagship models is $3–$15 input and $15–$75 output. Cheap-tier models like Gemini Flash and Haiku 4.5 are $0.30–$0.80 input and $2.50–$4.00 output.

1 million AI tokens costs $0.06 to $75 in 2026 depending on the model and direction. Full pricing breakdown across OpenAI, Claude, Gemini, Llama, and DeepSeek.

Updated 2026-05-117 min read· By AITOT Editorial

One million AI tokens costs between $0.06 and $75 in 2026 depending on the model and direction (input vs output). Output tokens cost 3–5× more than input tokens on most providers because generation is compute-bound. This guide shows exactly what 1M tokens costs across 22 models and helps you calculate your real bill before committing budget. For real-time pricing across every model, use our Token & Pricing Comparator.

The "1 million tokens" unit is the standard pricing reference because it's roughly the unit at which provider bills become non-trivial. A 100k-request/month chatbot at typical sizes burns through 200M+ tokens — multiply by your blended per-million rate and that's your monthly inference bill.

What does 1 million tokens cost across all major models?

The complete 2026 pricing table for output tokens (typically the dominant cost):

Model	Input / 1M	Output / 1M
Amazon Nova Lite	$0.06	$0.24
Mistral Small 3	$0.20	$0.60
Google Gemini 2.5 Flash	$0.30	$2.50
DeepSeek V3	$0.27	$1.10
GPT-5 mini	$0.40	$1.60
Cohere Command R	$0.15	$0.60
Claude Haiku 4.5	$0.80	$4.00
Amazon Nova Pro	$0.80	$3.20
DeepSeek R1	$0.55	$2.19
Llama 4 70B (Together)	$0.88	$0.88
Mistral Large 2	$2.00	$6.00
Cohere Command R+	$2.50	$10.00
OpenAI GPT-4o	$2.50	$10.00
Google Gemini 2.5 Pro	$2.50	$15.00
Claude Sonnet 4.6	$3.00	$15.00
Llama 4 405B (Together)	$3.50	$3.50
xAI Grok 4	$5.00	$25.00
OpenAI o3	$10.00	$40.00
OpenAI GPT-5	$10.00	$30.00
Claude Opus 4.7	$15.00	$75.00

That's a 1,250× spread between cheapest input (Nova Lite $0.06) and most expensive output (Opus 4.7 $75). The strategy that wins in 2026 is tiered model routing: cheap model for 80% of requests, premium model only when needed.

Why does output cost more than input?

Three reasons output is structurally more expensive than input:

Sequential generation. Output tokens are produced one at a time — token N depends on tokens 1 to N-1. Each output token requires a full forward pass through the model. Input tokens can be processed in parallel (one pass for the entire prompt).
Memory bandwidth dominates. At inference time, the bottleneck is reading the model weights from GPU HBM for each output token, not the compute. Output is ~5× more bandwidth-intensive per token.
GPU utilization patterns. Output generation underutilizes large GPU clusters (small batch = low parallelism). Providers price this opportunity cost.

Practical implication: if your workload is heavy on input (RAG, document analysis), you can use models that have a lower output-to-input price ratio. Claude has a 5:1 ratio on Sonnet; Llama on Together is 1:1 because they don't differentiate.

How big is 1 million tokens in real terms?

Practical sizing for 1 million tokens:

~750,000 words in English text (1.33 tokens/word average)
~4 average novels of prose (200k words each)
~3 million characters of code (more tokens than English due to syntax)
~50 hours of transcribed speech at 150 words/minute
~600 typical chatbot conversations of 10-turn each (1,700 tokens/conversation)

For a real production calibration:

A customer-support chatbot: 200k-300k tokens per 1,000 conversations
A code-completion product (Copilot-style): 100k-500k tokens per active user per day
A research-agent product (Devin-style): 50k-200k tokens per task

Use these to calibrate your monthly forecast. Multiply daily tokens × 30 × your model's per-million rate.

What does the median app actually pay per month?

Industry survey of 2025-2026 AI startup AWS bills shows the median monthly LLM spend by application category:

Application	Monthly tokens (median)	Monthly spend (Sonnet 4.6 blended)
Internal AI tools	10M	$90
B2B SaaS with AI features	50M	$450
Customer support chatbot	150M	$1,350
Coding assistant	400M	$3,600
Consumer chat product	2,000M (2B)	$18,000
AI agent platform	10,000M (10B)	$90,000

The 10× spread between consumer chat and B2B SaaS reflects raw user volume difference, not architectural choice. Even an "expensive" model is fine at B2B scale.

How do I calculate my per-million-token cost?

For a chatbot with 2,000 input tokens + 400 output tokens per request, using Claude Sonnet 4.6 at $3 input + $15 output per million:

Per request:
  Input: 2000 × $3 / 1M = $0.006
  Output: 400 × $15 / 1M = $0.006
  Total: $0.012

Per 100,000 requests:
  $0.012 × 100k = $1,200

That's effectively $5 per million tokens blended (200M tokens used to generate $1,200 of bill).

Notice the blended rate depends on your input-to-output ratio. A RAG-heavy workload with 95% input and 5% output sees a much cheaper effective rate on the same model. Plug your specific numbers into our Token & Pricing Comparator for real-time math.

How do I reduce per-million-token cost in 2026?

Three highest-leverage moves:

1. Switch models (5-50× reduction possible)

Most workloads run fine on Claude Haiku 4.5 ($0.80 input, $4 output) instead of Claude Sonnet 4.6 ($3 input, $15 output) — a 4× cost cut. Or drop to Gemini 2.5 Flash ($0.30 / $2.50) for another 3× cut. Always run a 100-example eval before switching; many workloads tolerate Haiku-class quality.

2. Prompt caching (40-80% input cost reduction)

For RAG workloads where the same context (system prompt + retrieved documents) gets reused across multiple queries, Anthropic charges only 10% of normal input price on cache hits. OpenAI charges 50%. Google 25%. Real-world cache hit rates are 50-70% steady-state.

3. Batch APIs (50% discount on non-realtime)

OpenAI Batch API charges 50% of normal pricing for jobs that can wait up to 24 hours. Anthropic Batch API similar. Most providers offer some form of batch discount. Use for: nightly summarization, content moderation backfills, embedding generation, evaluation runs.

A typical mature production workload combines all three: tiered routing + caching + batch. Total reduction vs naive: 70-90%. Plug your numbers into the LLM Monthly Cost Estimator to see the projection.

What hidden costs make the per-million rate misleading?

Five line items that aren't in the headline rate:

Output rate limits. Some providers throttle output tokens/minute. Bursting traffic incurs queuing latency, not cost — but user experience suffers.
Failed generations. Safety refusals, malformed JSON outputs, mid-stream disconnects. Real-world wastage is 3-8% of token budget.
Speculative decoding. Some providers charge for speculatively-generated tokens that get rejected. Adds 5-15% to the bill.
Long context surcharges. Google Vertex charges 2× per token for contexts >128k. Anthropic has no surcharge but TTFT degrades.
Cross-region transfer. Self-hosted models incur egress fees that the per-token rate doesn't capture.

Budget a 15-20% buffer above your raw per-million-token math. The Agent Dev Cost Calculator bakes this in as the "inference tax" default of 30% — appropriate for agentic workloads.

What is the cheapest path to 1 million tokens in 2026?

If you only care about per-million-token cost (not quality or reliability):

Amazon Nova Lite at $0.06 input, $0.24 output — 100M tokens for $30 total
DeepSeek V3 at $0.27 input, $1.10 output — strong reasoning at cheap pricing
Self-hosted Llama 4 8B on rented H100 — break-even at ~500M tokens/month
Together Llama 4 8B at $0.22 / $0.22 — open-weight on hosted infra

For the cheapest flagship-class quality, Claude Sonnet 4.6 at $3 / $15 is the sweet spot. GPT-5 at $10 / $30 is premium-priced; rarely worth it over Sonnet unless you specifically need OpenAI ecosystem features.

The 2026 best practice is to track effective cost per resolved task, not per-million-token. A workload that uses 30% fewer tokens on Sonnet (because it gets answers right faster) can beat a workload that uses cheap Haiku tokens but loops 3× on bad outputs. Measure outcomes, not unit cost.

For comprehensive cost modeling across token + GPU + vector DB + everything else, use the Agent Dev Cost Calculator. For just the LLM piece across 22 models, use the Token & Pricing Comparator.